Moving from GitHub Pages to Hetzner and Caddy with Analytics

Moving websites from GitHub pages to a Hetzner server with Caddy
published: Fri Dec 27 2024
by Harshvardhan J. Pandit
is part of: harshp.com
analytics back-end DPVCG front-end server web-dev website

Goal: to move the websites for dpvcg.org, dev.dpvcg.org, and harshp.com to a custom server running Linux, and setup automatic updates, and rudimentary analytics.

Server Provider - Hetzner

I went with a Hetzner CAX21 4vCPU - Ampere ARM64 with 8GB RAM 80GB SSD and 20TB bandwidth/month with IPv4 address for about €8/month running Debian 12. Painless setup - everything was easy. Hetzner asked for ssh keys and added them automatically to the server so I could directly ssh in once the server was ready (which it was in a minute).

Hetzner created the Debian server in a barebones manner. So the first time I logged in, I was root. To avoid the usual security pitfalls, its better to first create a user for myself, and then give it sudo permissions so I can install and do other admin stuff:

adduser harsh
usermod -aG sudo harsh

Web Server - Caddy

I’ve had experience with Apache2 and Nginx before. Apache2 is a beast - super powerful and all capable but very very verbose and complex to use and maintain. Also I’ve forgotten everything about how to use it - so it meant spending time relearning it. With Nginx, the configuration aspects are much easier, and there is a wide community who has built things with it. So that was the ‘safe choice’ option. But since this is an activity during the Christmas holidays, I decided to try out Caddy - intrigued by its premise of simplicity and bolts-included premise.

# assuming sudo
apt-get install caddy
systemctl start caddy
systemctl status caddy  # everything looks okay

Caddy config is much smaller and simpler than that of Nginx and Apache. Though the materials online can be tricky to figure out as there is a rather sparse availability of configs that do exactly what you want it to. But the approach of building config files has been made much nicer through commands that take care of most of the stuff.

The basic config file preinstalled is at /etc/caddy/Caddyfile. To enable modular configs, we simply create a directory called Caddyfile.d and import configs from it.

# /etc/config/caddy/Caddyfile
import Caddyfile.d/*.caddyfile

Then in Caddyfile.d we add specific files for each config. I have three websites that need to be served - all are static files so there is no process running (so far), which means we don’t need a reverse proxy here, just the file server.

# domain name
harshp.com {
    # root of website maps to following path
    root * /usr/share/caddy/harshp.com
    # this is for setting CORS options
    @cors_preflight {
                method OPTIONS
        }
        respond @cors_preflight 204
        
    header {
        # this is where I store my media files
        Access-Control-Allow-Origin https://media.harshp.com
        Access-Control-Allow-Methods GET,POST,OPTIONS,HEAD,PATCH,PUT,DELETE
        Access-Control-Allow-Headers User-Agent,Content-Type,X-Api-Key
        Access-Control-Max-Age 86400
    }
    # saves space, is faster
    encode gzip
    # try url with .html prefix so we don't have to
    try_files {path} {path}.html
    # serve 404 using this file
    handle_errors {
       rewrite * 404.html
       file_server
    }
    # don't expose the .git folder
    file_server {
        hide .git
    }
    # this is where the access logs go (including errors)
    log {
        output file /var/log/caddy/harshp.com.net-access.log
    }
}

# subdomain www
www.harshp.com {
    # redirect it to full server
    redir https://harshp.com{path}
}

The config for dev.dpvcg.org requires some redirections based on path so that if we don’t specify a version number, it automatically goes to the latest version under development. To do this, Caddy config syntax is really simple:

dev.dpvcg.org {
    root * /usr/share/caddy/dev.dpvcg.org
        # redirect <from> <to>
        redir /ai* 2.1-dev/{path}
        redir /diagrams* 2.1-dev/{path}
        redir /dpv* 2.1-dev/{path}
        redir /examples* 2.1-dev/{path}
        redir /justifications* 2.1-dev/{path}
        redir /legal* 2.1-dev/{path}
        redir /loc* 2.1-dev/{path}
        redir /pd* 2.1-dev/{path}
        redir /risk* 2.1-dev/{path}
        redir /search.html* 2.1-dev/{path}
        redir /sector* 2.1-dev/{path}
        redir /tech* 2.1-dev/{path}
    encode gzip
    file_server {
        hide .git
    }
    log {
        output file /var/log/caddy/dev.dpvcg.org.net-access.log
    }
}

Other stuff I have yet to figure out is what to put in the config so that urls ending in a forward slash like path/ are redirected to path.html instead because that’s what I have in my website. Maybe I should follow the general convention instead of doing path/index.html instead which Caddy supports by default.

Yet other stuff that will be needed in the future is content negotiation so that we can request RDF formats specifically. There is a Caddy plugin called caddy-conneg that seems to provide this. But to install plugins in to Caddy, I have to effectively reinstall Caddy, or - a simpler way - to go to Caddy plugins page and select the plugin and download a precompiled binary (caveat emptor). Then I replace the binary at /usr/bin/caddy with this, and we’re apparently good to go. Need to test if this works later.

Serving Websites

To serve static websites, they must first be synced on to the server. Currently, these websites are stored on GitHub, for example as https://github.com/coolharsh55/dpvcg.org, and using git we can clone it and keep it in sync on the server. To avoid having to do this manually every time there is an update (and then forgetting to sync it sometimes), its better to automate the local git repo so that it always stays in sync with the remote repo.

To do this, first we need to clone the git repo locally. These folders require specific permissions as Caddy should also be able to read them in order to serve them. Instead of individuall managing permissions (for the local user, for the Caddy user), its better to create a shared group (lets call it dev) and use this to give access to things.

groupadd dev
usermod -aG dev harsh
usermod -aG dev caddy

Then we need to give access to this group for the folder where all files will be stored - we can either do this on a folder by folder basis, or we can give access to the parent folder. I opt for the parent folder approach since its convenient and easier to later add in more stuff. For Caddy, the default path for files to be served is /usr/share/caddy, so we use that:

chown -R harsh:dev /usr/share/caddy # make user and group owner
chmod 0755 /usr/share/caddy # enable all users to read, only me to modify
git clone <repo> /usr/share/caddy/<folder>

Once this is done, we set up an automated script called every 15minutes using cron to update all the folders at this path if they are git repos.

# Cron script to run git-repo-update.sh every 15 minutes
crontab -e # and then the below script
*/15 * * * * /usr/bin/bash /home/harsh/bin/git-repo-update.sh > /dev/null 2>&1

# /home/harsh/bin/git-repo-update.sh and then the below contents
#!/usr/bin/env bash
for subfolder in /usr/share/caddy/*/; do
    if [ -d "$subfolder/.git" ]; then
        cd /usr/share/caddy/"$subfolder" || continue
        git pull --rebase > /dev/null 2>&1
    fi
done

Analytics

Matomo is a full featured and privacy-conscious solution that can be self-hosted. However, it requires setting up a process (to run Matomo) and a database (to store analytics data), as well as integration in to each of the pages being served. Since some of the websites I’m serving are ‘mirrors’ sites and others are ‘official’ websites, the setup should be such that its minimal effort to move them. Therefore, I went with GoAccess which is a lightweight app to analyse logs from servers - and which has built-in support for Caddy’s JSON log formats.

The Caddy log files are stored by default (unless the config has a different path) at /var/log/caddy. So we simply need to point goaccess to the log file and generate an output HTML. GoAccess also has a ‘webserver’ mode where it analyses traffic in real-time, but I don’t need such level of detail for this. So instead, I set up a cron job where the analytics are run every 15mins.

apt-get install goaccess
vim /home/harsh/bin/git-repo-update.sh # and then the below contents
# within the for -> if section, add the goaccess output
goaccess /var/log/caddy/"$subfolder".net-access.log -o ./analytics.html --log-format=CADDY