Moving from GitHub Pages to Hetzner and Caddy with Analytics
published:
by Harshvardhan J. Pandit
is part of: harshp.com
analytics back-end DPVCG front-end server web-dev website
Goal: to move the websites for dpvcg.org, dev.dpvcg.org, and harshp.com to a custom server running Linux, and setup automatic updates, and rudimentary analytics.
Server Provider - Hetzner
I went with a Hetzner CAX21 4vCPU - Ampere ARM64 with 8GB RAM 80GB SSD and 20TB bandwidth/month with IPv4 address for about €8/month running Debian 12. Painless setup - everything was easy. Hetzner asked for ssh keys and added them automatically to the server so I could directly ssh in once the server was ready (which it was in a minute).
Hetzner created the Debian server in a barebones manner. So the first
time I logged in, I was root
. To avoid the usual security
pitfalls, its better to first create a user for myself, and then give it
sudo
permissions so I can install and do other admin
stuff:
adduser harsh
usermod -aG sudo harsh
Web Server - Caddy
I’ve had experience with Apache2 and Nginx before. Apache2 is a beast - super powerful and all capable but very very verbose and complex to use and maintain. Also I’ve forgotten everything about how to use it - so it meant spending time relearning it. With Nginx, the configuration aspects are much easier, and there is a wide community who has built things with it. So that was the ‘safe choice’ option. But since this is an activity during the Christmas holidays, I decided to try out Caddy - intrigued by its premise of simplicity and bolts-included premise.
# assuming sudo
apt-get install caddy
systemctl start caddy
systemctl status caddy # everything looks okay
Caddy config is much smaller and simpler than that of Nginx and Apache. Though the materials online can be tricky to figure out as there is a rather sparse availability of configs that do exactly what you want it to. But the approach of building config files has been made much nicer through commands that take care of most of the stuff.
The basic config file preinstalled is at
/etc/caddy/Caddyfile
. To enable modular configs, we simply
create a directory called Caddyfile.d
and import configs
from it.
# /etc/config/caddy/Caddyfile
import Caddyfile.d/*.caddyfile
Then in Caddyfile.d
we add specific files for each
config. I have three websites that need to be served - all are static
files so there is no process running (so far), which means we don’t need
a reverse proxy here, just the file server.
# domain name
harshp.com {
# root of website maps to following path
root * /usr/share/caddy/harshp.com
# this is for setting CORS options
@cors_preflight {
method OPTIONS
}
respond @cors_preflight 204
header {
# this is where I store my media files
Access-Control-Allow-Origin https://media.harshp.com
Access-Control-Allow-Methods GET,POST,OPTIONS,HEAD,PATCH,PUT,DELETE
Access-Control-Allow-Headers User-Agent,Content-Type,X-Api-Key
Access-Control-Max-Age 86400
}
# saves space, is faster
encode gzip
# try url with .html prefix so we don't have to
try_files {path} {path}.html
# serve 404 using this file
handle_errors {
rewrite * 404.html
file_server
}
# don't expose the .git folder
file_server {
hide .git
}
# this is where the access logs go (including errors)
log {
output file /var/log/caddy/harshp.com.net-access.log
}
}
# subdomain www
www.harshp.com {
# redirect it to full server
redir https://harshp.com{path}
}
The config for dev.dpvcg.org
requires some redirections
based on path so that if we don’t specify a version number, it
automatically goes to the latest version under development. To do this,
Caddy config syntax is really simple:
dev.dpvcg.org {
root * /usr/share/caddy/dev.dpvcg.org
# redirect <from> <to>
redir /ai* 2.1-dev/{path}
redir /diagrams* 2.1-dev/{path}
redir /dpv* 2.1-dev/{path}
redir /examples* 2.1-dev/{path}
redir /justifications* 2.1-dev/{path}
redir /legal* 2.1-dev/{path}
redir /loc* 2.1-dev/{path}
redir /pd* 2.1-dev/{path}
redir /risk* 2.1-dev/{path}
redir /search.html* 2.1-dev/{path}
redir /sector* 2.1-dev/{path}
redir /tech* 2.1-dev/{path}
encode gzip
file_server {
hide .git
}
log {
output file /var/log/caddy/dev.dpvcg.org.net-access.log
}
}
Other stuff I have yet to figure out is what to put in the config so
that urls ending in a forward slash like path/
are
redirected to path.html
instead because that’s what I have
in my website. Maybe I should follow the general convention instead of
doing path/index.html
instead which Caddy supports by
default.
Yet other stuff that will be needed in the future is content
negotiation so that we can request RDF formats specifically. There is a
Caddy plugin called caddy-conneg
that seems to provide
this. But to install plugins in to Caddy, I have to effectively
reinstall Caddy, or - a simpler way - to go to Caddy
plugins page and select the plugin and download a precompiled binary
(caveat emptor). Then I replace the binary at
/usr/bin/caddy
with this, and we’re apparently good to go.
Need to test if this works later.
Serving Websites
To serve static websites, they must first be synced on to the server.
Currently, these websites are stored on GitHub, for example as https://github.com/coolharsh55/dpvcg.org,
and using git
we can clone it and keep it in sync on the
server. To avoid having to do this manually every time there is an
update (and then forgetting to sync it sometimes), its better to
automate the local git repo so that it always stays in sync with the
remote repo.
To do this, first we need to clone the git repo locally. These
folders require specific permissions as Caddy should also be able to
read them in order to serve them. Instead of individuall managing
permissions (for the local user, for the Caddy user), its better to
create a shared group (lets call it dev
) and use this to
give access to things.
groupadd dev
usermod -aG dev harsh
usermod -aG dev caddy
Then we need to give access to this group for the folder where all
files will be stored - we can either do this on a folder by folder
basis, or we can give access to the parent folder. I opt for the parent
folder approach since its convenient and easier to later add in more
stuff. For Caddy, the default path for files to be served is
/usr/share/caddy
, so we use that:
chown -R harsh:dev /usr/share/caddy # make user and group owner
chmod 0755 /usr/share/caddy # enable all users to read, only me to modify
git clone <repo> /usr/share/caddy/<folder>
Once this is done, we set up an automated script called every
15minutes using cron
to update all the folders at this path
if they are git repos.
# Cron script to run git-repo-update.sh every 15 minutes
crontab -e # and then the below script
*/15 * * * * /usr/bin/bash /home/harsh/bin/git-repo-update.sh > /dev/null 2>&1
# /home/harsh/bin/git-repo-update.sh and then the below contents
#!/usr/bin/env bash
for subfolder in /usr/share/caddy/*/; do
if [ -d "$subfolder/.git" ]; then
cd /usr/share/caddy/"$subfolder" || continue
git pull --rebase > /dev/null 2>&1
fi
done
Analytics
Matomo is a full featured and privacy-conscious solution that can be self-hosted. However, it requires setting up a process (to run Matomo) and a database (to store analytics data), as well as integration in to each of the pages being served. Since some of the websites I’m serving are ‘mirrors’ sites and others are ‘official’ websites, the setup should be such that its minimal effort to move them. Therefore, I went with GoAccess which is a lightweight app to analyse logs from servers - and which has built-in support for Caddy’s JSON log formats.
The Caddy log files are stored by default (unless the config has a
different path) at /var/log/caddy
. So we simply need to
point goaccess
to the log file and generate an output HTML.
GoAccess also has a ‘webserver’ mode where it analyses traffic in
real-time, but I don’t need such level of detail for this. So instead, I
set up a cron
job where the analytics are run every
15mins.
apt-get install goaccess
vim /home/harsh/bin/git-repo-update.sh # and then the below contents
# within the for -> if section, add the goaccess output
goaccess /var/log/caddy/"$subfolder".net-access.log -o ./analytics.html --log-format=CADDY