Actually making my Docker Swarm highly available

One of the chores I've been pushing asides for a while had to do with critical availability of some of the nodes in the Swarm cluster, which led to major downtime in a lot of the services I host.
My Swarm cluster utilizes Raspberry Pi 5s as compute nodes, and my Ampere server as the mothership. There are a few issues:
- Some of the Raspberry Pis randomly become unresponsive (ie: trying to SSH into it causes it to hang). From Docker Swarm Mode's perspective, the node is still "available" but tasks that are already assigned to it are no longer responsive, and new tasks that are trying to be assigned to it gets stuck in assigning state as well.
- NginxProxyManager is currently configured to point to my ampere server's IP when resolving subdomains. This means that if my main server goes down, all of my services go down.
Node Health Detection
As explained above, the symptom and problem statement is that a node would go unresponsive - I would be able to ping the host, but any kind of actual things like attempting to SSH fails:
ampere:~/$ ssh fleet1
ssh: connect to host fleet1 port 22: Connection refused
ampere:~/$ ping fleet1
PING fleet1.localdomain (192.168.1.14) 56(84) bytes of data.
64 bytes from fleet1.localdomain (192.168.1.14): icmp_seq=1 ttl=64 time=0.203 ms
64 bytes from fleet1.localdomain (192.168.1.14): icmp_seq=2 ttl=64 time=0.164 ms
64 bytes from fleet1.localdomain (192.168.1.14): icmp_seq=3 ttl=64 time=6.20 ms
--- fleet1.localdomain ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2014ms
rtt min/avg/max/mdev = 0.164/2.189/6.200/2.836 ms
ping is working and thus not a good test - SSH failing is more appropriate though
In this case, the host is appropriately marked as Down
in docker, but there are cases when this doesn't happen.
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
hx3u1keow4qjr16gxjokps5yx * ampere Ready Active Reachable 28.4.0
wjp82tvbuz2r65ls3scon2j7x fleet0 Ready Active 28.4.0
ekaur1sizw2gvma6p16drx8bq fleet1 Down Active 28.1.1
With a little bit of help from AI, I found that netcat
is a good test to detect application layer availability:
ampere:~/$ nc -z -w 5 fleet1 22
ampere:~/$ nc -z -w 5 fleet2 22
Connection to fleet2 (192.168.1.128) 22 port [tcp/ssh] succeeded!
nc
(netcat) to the unavailable host has no output, while a working host results in a success message
Using this, I created a script in Portainer as a Config file, and a new Stack to run that script in a set interval:
#!/bin/bash
# Get a list of all worker nodes
NODES=$(docker node ls --format "{{.Hostname}}" --filter "role=worker")
for NODE in $NODES
do
# Check if the node is reachable with a short timeout
if ! nc -z -w 5 $NODE 22 > /dev/null 2>&1
then
echo "Node $NODE is unresponsive. Draining..."
# If the node is unresponsive, drain it
docker node update --availability drain $NODE
else
echo "Node $NODE is healthy. Checking availability..."
# Check if the node is already drained
AVAILABILITY=$(docker node inspect -f '{{.Spec.Availability}}' $NODE)
if [ "$AVAILABILITY" == "drain" ]
then
echo "Node $NODE was previously drained. Setting it back to active."
docker node update --availability active $NODE
fi
fi
done
config, named availability
configs:
availability:
external: true
services:
availability:
image: docker:24-cli # Use a lightweight image with the docker CLI
volumes:
- "/var/run/docker.sock:/var/run/docker.sock"
command: sh /availability
configs:
- source: availability
target: /availability
mode: 0755
deploy:
placement:
constraints:
- node.role == manager
restart_policy:
delay: 1m
Docker swarm stack referencing the shell script
After deploying this stack, I checked the logs for one of the finished containers, and saw it working in action.
Node fleet0 is healthy. Checking availability...
Node fleet1 is unresponsive. Draining...
fleet1
Node fleet2 is healthy. Checking availability...
Node fleet3 is healthy. Checking availability...
Node fleet4 is healthy. Checking availability...
Node fleet5 is healthy. Checking availability...
Node nginxproxy is healthy. Checking availability...
Container logs
Rather than relying on a cron job, I've decided to try to have everything natively done through Docker/Portainer, which runs once every minute.
Theoretically I want to automatically powercycle unhealthy devices by directly invoking Unifi APIs, but as a dev who builds production software and hardware that self-reboots upon failure detection, I know that there's a whole bunch of things to consider, such as reboot attempts, timeouts, guardrails to prevent power cycle boot loops, etc. So for now I think a good alternative might be integrating with a push notification service like ntfy
which seems to also have self-hosting options.

I'll set this system up in some future time.
Routing Availability
The plan here is simple: Create an overlay network for all of my docker services that I want behind the reverse proxy, and modify my NginxProxyManager proxy hosts to point to the docker swarm mode service names.
Here is for instance my reverse proxy host configuration for this blog. You can see that it's directly pointing to my ampere
host (my manager node).

First, I create a new network through Docker CLI:
ampere:~/$ docker network create --driver overlay proxy_network
I do this through CLI because Portainer for some reason doesn't let me through their GUI.
Next, I update my NPM stack to be deployed to this overlay network
services:
app:
image: 'jc21/nginx-proxy-manager:2.12.2'
restart: unless-stopped
ports:
- '80:80'
- '81:81'
- '443:443'
networks:
- proxy_network
... redacted ...
networks:
proxy_network:
external: true
Now I just need to start migrating all my services to:
- Stop publishing a public port
- Deploy to the same overlay network (assuming I want them public internet facing behind the reverse proxy)
Following is an example of this blog's stack that was updated (Obviously with critical stuff redacted)
version: '3.1'
services:
ghost:
image: ghost:5-alpine
networks:
- proxy_network
environment:
database__connection__host: ghost_service_ghost-db
database__connection__port: 3306
ghost-db:
image: mysql:8.0
restart: always
networks:
- proxy_network
networks:
proxy_network:
external: true
As you can see, neither service is publishing a public port anymore, and is deployed to the proxy_network
. You can also see that I had to update the ghost service's connection URL to also utilize service names instead of host name + public port combinations.

Lastly is the updated NPM configuration to use the service name + ghost's internal port.
Conclusion
With the above two chores complete, my network is much more resilient to outages or degradation due to hosts randomly going down.
Few more things that I need to do:
- My port forwarding rule is pointing to a specific IP designated to one of my hosts. That host going down is still a single point of failure because that will cause an outage on port forwarding level. I'll be exploring Virtual IP applications to fix this.
- I will do a bit more fiddling around with
ntfy
and see how that works!