Actually making my Docker Swarm highly available

Actually making my Docker Swarm highly available

One of the chores I've been pushing asides for a while had to do with critical availability of some of the nodes in the Swarm cluster, which led to major downtime in a lot of the services I host.

My Swarm cluster utilizes Raspberry Pi 5s as compute nodes, and my Ampere server as the mothership. There are a few issues:

  1. Some of the Raspberry Pis randomly become unresponsive (ie: trying to SSH into it causes it to hang). From Docker Swarm Mode's perspective, the node is still "available" but tasks that are already assigned to it are no longer responsive, and new tasks that are trying to be assigned to it gets stuck in assigning state as well.
  2. NginxProxyManager is currently configured to point to my ampere server's IP when resolving subdomains. This means that if my main server goes down, all of my services go down.

Node Health Detection

As explained above, the symptom and problem statement is that a node would go unresponsive - I would be able to ping the host, but any kind of actual things like attempting to SSH fails:

ampere:~/$ ssh fleet1
ssh: connect to host fleet1 port 22: Connection refused
ampere:~/$ ping fleet1
PING fleet1.localdomain (192.168.1.14) 56(84) bytes of data.
64 bytes from fleet1.localdomain (192.168.1.14): icmp_seq=1 ttl=64 time=0.203 ms
64 bytes from fleet1.localdomain (192.168.1.14): icmp_seq=2 ttl=64 time=0.164 ms
64 bytes from fleet1.localdomain (192.168.1.14): icmp_seq=3 ttl=64 time=6.20 ms
--- fleet1.localdomain ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2014ms
rtt min/avg/max/mdev = 0.164/2.189/6.200/2.836 ms

ping is working and thus not a good test - SSH failing is more appropriate though

In this case, the host is appropriately marked as Down in docker, but there are cases when this doesn't happen.

ID                            HOSTNAME     STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
hx3u1keow4qjr16gxjokps5yx *   ampere       Ready     Active         Reachable        28.4.0
wjp82tvbuz2r65ls3scon2j7x     fleet0       Ready     Active                          28.4.0
ekaur1sizw2gvma6p16drx8bq     fleet1       Down      Active                          28.1.1

With a little bit of help from AI, I found that netcat is a good test to detect application layer availability:

ampere:~/$ nc -z -w 5 fleet1 22
ampere:~/$ nc -z -w 5 fleet2 22
Connection to fleet2 (192.168.1.128) 22 port [tcp/ssh] succeeded!

nc (netcat) to the unavailable host has no output, while a working host results in a success message

Using this, I created a script in Portainer as a Config file, and a new Stack to run that script in a set interval:

#!/bin/bash

# Get a list of all worker nodes
NODES=$(docker node ls --format "{{.Hostname}}" --filter "role=worker")

for NODE in $NODES
do
  # Check if the node is reachable with a short timeout
  if ! nc -z -w 5 $NODE 22 > /dev/null 2>&1
  then
    echo "Node $NODE is unresponsive. Draining..."
    # If the node is unresponsive, drain it
    docker node update --availability drain $NODE
  else
    echo "Node $NODE is healthy. Checking availability..."
    # Check if the node is already drained
    AVAILABILITY=$(docker node inspect -f '{{.Spec.Availability}}' $NODE)
    if [ "$AVAILABILITY" == "drain" ]
    then
      echo "Node $NODE was previously drained. Setting it back to active."
      docker node update --availability active $NODE
    fi
  fi
done

config, named availability

configs:
  availability:
    external: true

services:
  availability:
    image: docker:24-cli # Use a lightweight image with the docker CLI
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock"
    command: sh /availability
    configs:
      - source: availability
        target: /availability
        mode: 0755
    deploy:
      placement:
        constraints:
          - node.role == manager
      restart_policy:
        delay: 1m

Docker swarm stack referencing the shell script

After deploying this stack, I checked the logs for one of the finished containers, and saw it working in action.

Node fleet0 is healthy. Checking availability...

Node fleet1 is unresponsive. Draining...

fleet1

Node fleet2 is healthy. Checking availability...

Node fleet3 is healthy. Checking availability...

Node fleet4 is healthy. Checking availability...

Node fleet5 is healthy. Checking availability...

Node nginxproxy is healthy. Checking availability...

Container logs

Rather than relying on a cron job, I've decided to try to have everything natively done through Docker/Portainer, which runs once every minute.

Theoretically I want to automatically powercycle unhealthy devices by directly invoking Unifi APIs, but as a dev who builds production software and hardware that self-reboots upon failure detection, I know that there's a whole bunch of things to consider, such as reboot attempts, timeouts, guardrails to prevent power cycle boot loops, etc. So for now I think a good alternative might be integrating with a push notification service like ntfy which seems to also have self-hosting options.

ntfy
Send push notifications to your phone via PUT/POST

I'll set this system up in some future time.

Routing Availability

The plan here is simple: Create an overlay network for all of my docker services that I want behind the reverse proxy, and modify my NginxProxyManager proxy hosts to point to the docker swarm mode service names.

Here is for instance my reverse proxy host configuration for this blog. You can see that it's directly pointing to my ampere host (my manager node).

First, I create a new network through Docker CLI:

ampere:~/$ docker network create --driver overlay proxy_network

I do this through CLI because Portainer for some reason doesn't let me through their GUI.

Next, I update my NPM stack to be deployed to this overlay network

services:
  app:
    image: 'jc21/nginx-proxy-manager:2.12.2'
    restart: unless-stopped
    ports:
      - '80:80'
      - '81:81'
      - '443:443'
    networks:
      - proxy_network
                              ... redacted ...
    
networks:
  proxy_network:
    external: true

Now I just need to start migrating all my services to:

  1. Stop publishing a public port
  2. Deploy to the same overlay network (assuming I want them public internet facing behind the reverse proxy)

Following is an example of this blog's stack that was updated (Obviously with critical stuff redacted)

version: '3.1'

services:
  ghost:
    image: ghost:5-alpine
    networks:
      - proxy_network
    environment:
      database__connection__host: ghost_service_ghost-db
      database__connection__port: 3306

  ghost-db:
    image: mysql:8.0
    restart: always
    networks:
      - proxy_network

networks:
  proxy_network:
    external: true

As you can see, neither service is publishing a public port anymore, and is deployed to the proxy_network. You can also see that I had to update the ghost service's connection URL to also utilize service names instead of host name + public port combinations.

Lastly is the updated NPM configuration to use the service name + ghost's internal port.

Conclusion

With the above two chores complete, my network is much more resilient to outages or degradation due to hosts randomly going down.

Few more things that I need to do:

  1. My port forwarding rule is pointing to a specific IP designated to one of my hosts. That host going down is still a single point of failure because that will cause an outage on port forwarding level. I'll be exploring Virtual IP applications to fix this.
  2. I will do a bit more fiddling around with ntfy and see how that works!