A small distributed system

For the curious: the system that hosts this Website

I’ll be placing information on the distributed system that is running this and other sites. The main reason for running my own stuff is to learn. So many things can, and will, go wrong. Already for many years, I’ve been advocating that we shouldn’t be programming distributed systems, but instead composing them. This means gluing pieces of software, hardware, and actual machines together into a system. That shifts the emphasis to the glue, which consists of scripts, cron jobs, and whatnot. It also requires being able to switch stuff off, and more importantly, back on. Remotely. I will gradually add material, hopefully inspiring others to follow similar steps and avoid the mistakes I made.

In the advent of AI, I’ve been handing my scripts to Claude for assessment and improvement. It has led to real improvements and, to me, illustrates a way to use such powerful tools. One important thing to keep in mind: I have the idea and the overall solution; I feed Claude with an initial script that does most of the work, and then we take it from there. It’s one of the few ways to make sure you keep understanding what’s going on.

For starters, here’s a story of the most difficult part of the system: cloning a main server to a warm standby. Thanks to the interaction with Claude, the initial script turned into something truly useful.

rsync -a Is Not “Clone a Machine”

Notes from building a warm standby that I might actually need one day — and everything that only showed up once it had to run for real.

The premise

I keep a second server at a different location as a warm standby. The idea is simple, and on paper it takes two steps. Design: a main server (I’ll call it zappa) runs my services; a standby (coltrane) at another site is a faithful clone; if zappa dies I switch DNS to coltrane. Implementation: an rsync of the whole filesystem through a daily wake-up for coltrane, and a small always-on machine (horton) at the standby site to capture the fast-changing bits. Done.

It was not done. What follows is the gap between that two-step picture and what it actually took — because the interesting part is precisely the things that no amount of up-front design surfaced, and that only experimentation did. (Host names are pseudonyms; I’ve stripped domains, addresses, ports, UUIDs, and the service inventory. More on why at the end. The code is sanitized and uses placeholders such as $STANDBY and $PEER.)

The shape of the thing

  • zappa — the main server. Runs a website that generates downloadable copies of a book on request, a self-hosted cloud, mail, and the usual.
  • coltrane — the standby, at a second site. Powered off most of the day; it wakes once a day, pulls a full clone from zappa, and goes back to sleep.
  • horton — the peer, a small, always-on machine at the standby site. It captures the volatile data on a tight cadence (downloaded copies every few minutes, incoming mail, database dumps), because coltrane is asleep and can’t.
  • Failover is manual: switch DNS, bring coltrane up as primary.

This is disaster recovery, not high availability — a human in the loop, minutes-to-hours of downtime accepted in exchange for something I can fully understand and repair myself. That framing matters, and I’ll come back to it.

1. A full-filesystem copy is not a consistent copy

The naive rsync -a --delete / of a running system copies a live database file-by-file while the database is writing to it. The result is a torn, possibly unrestorable copy. The fix is to capture logical state — a dump — and let it ride along with the file sync:

# Capture each database as a consistent logical snapshot, not a byte copy of a
# file the database is still writing to. --single-transaction gives a consistent
# view of InnoDB tables with no locking and no downtime.
for db in app_db cloud_db; do
    tmp="$DUMPDIR/$db.sql.gz.tmp"
    mysqldump --single-transaction --quick --routines --triggers "$db" \
        | gzip > "$tmp"
    mv "$tmp" "$DUMPDIR/$db.sql.gz"   # atomic: a reader never sees a half-written dump done

The self-hosted cloud drove this home because it isn’t “files.” It’s a data directory plus a database plus config, and they have to agree. Restoring an old database against newer files — or vice versa — produces dangling references the moment a client reconnects. Worse, a client that synced past the restored state can read the rolled-back server as “files were deleted” and propagate those deletions. The clean answer was a choice: capture the database and files together and consistently, or simply stall the cloud during failover (maintenance mode) so no client ever syncs against a stale server. A paused server is safe; a rolled-back one is not.

Lesson. Stateful services need consistent, application-aware capture — not a byte copy of whatever happened to be on disk at 3 a.m.

2. Where --delete meets reality

A mirror needs --delete, and --delete interacts with everything else in ways I kept rediscovering. Excluded files protect directories from deletion, so a directory that vanishes on the source but still holds excluded files can’t be emptied (“cannot delete non-empty directory”). Live files change mid-transfer: the systemd journal is written continuously, so rsync copies it, the checksum no longer matches, and it reports “failed verification — update discarded” — and deleting it doesn’t help because it’s regenerated instantly. And --delete will happily walk into a mount point the target has but the source lacks, deleting through it into another machine entirely.

# Exclude every virtual / pseudo / FUSE mountpoint on BOTH machines.
# --one-file-system stops the SENDER crossing its own mounts, but --delete runs
# on the RECEIVER: a mount the standby has, that the source lacks, would otherwise
# look "extraneous" and be deleted THROUGH -- into another machine entirely.
findmnt -rno TARGET,FSTYPE                 >  mounts.tmp
ssh "$STANDBY" findmnt -rno TARGET,FSTYPE  >> mounts.tmp
awk '$2 ~ /^(proc|sysfs|tmpfs|devtmpfs|cgroup2?|devpts|mqueue)$/ || $2 ~ /^fuse/ \
     { print $1 }' mounts.tmp | sort -u > mount-excludes

rsync -a --delete --one-file-system \
      --max-delete=5000 \                  # a runaway delete alarms, never wipes silently
      --exclude-from=host-excludes \       # /etc/fstab, /etc/hostname, /boot/efi,
      --exclude-from=mount-excludes \      #   /var/log/journal, /etc/machine-id, ...
      /  "$STANDBY":/  || rc=$?
# exit 24 (a file vanished mid-transfer) is harmless; anything else is a failure.
Lesson. The volatile and the host-specific don’t belong in a clone. Exclude them deliberately — and remember an exclude list is also a protection list under --delete.

3. The boot — where “clone a machine” really breaks

This is the part I’d most badly underestimated. Of course you cannot copy one machine’s boot configuration onto different hardware and generally expect it to boot. I knew that and thought I had it handled. It turned out to be a nasty part. The filesystem table references disks by the source’s UUIDs, which don’t exist on the target. The kernel’s initramfs is built for the source’s storage controller; mine was built for an NVMe root while the standby boots from SATA, so the standby couldn’t find its own disk and hung in the initramfs with “device does not exist.” The bootloader configuration likewise points at the source’s root.

The right model is a split: the data clones freely; the boot stack is host-specific and must either be excluded so the target keeps its own, or rebuilt on the target after each clone. I did both — excluding the identity bits, and rebuilding the standby’s boot for its own hardware on every wake:

# After the clone, rebuild THIS machine's boot for ITS OWN hardware.
# Path-based only -- never a /dev/sdX device, because disk names differ per machine.
update-initramfs -u -k all                 # initrd carrying the standby's own drivers
grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=ubuntu
update-grub

And then the genuinely humbling one. The standby kept booting with the source’s root UUID, and I spent hours grepping every bootloader config file for that UUID. Nothing. It wasn’t in any config file. It was baked into the bootloader’s EFI core image — a binary — by an install that had once run while the root was misidentified. Compounding it, the source was a UEFI machine still carrying BIOS-flavoured bootloader packages, so the clone kept dragging a boot stack that couldn’t even install correctly on the UEFI/GPT standby. The recovery tool I leaned on “fixed” it every time, which neatly hid that the next clone re-broke it.

Lesson. The bad value is not always where it’s legible. I was reasoning about config files; the truth was in a binary. The moment I stopped grepping configs and searched the whole boot tree for the string — binaries included — it fell out in one line.

4. When the patient reports its own death

Here’s a subtle one. When the thing you’re cloning is also the thing that reports errors, a failed clone goes silent. The clone overwrites the standby’s own mail configuration mid-run, and then the standby powers itself off before any alert is delivered. I changed the destination address and still got nothing — because the message never left the box. The fix is to write anything that must survive the clone — alerts, health logs — somewhere the clone isn’t clobbering. I routed both through the always-on peer:

# This machine's own mail is overwritten by the clone (and it may power off before
# delivery), so send alerts through an intact PEER over SSH. Fall back to local
# mail only if the peer is unreachable.
notify() {
    printf '%s\n' "$2" \
        | ssh "$PEER" "mail -s '$1' '$ADMIN'" 2>/dev/null \
      || printf '%s\n' "$2" | mail -s "$1" "$ADMIN" 2>/dev/null \
      || true
}
Lesson. Observability for a self-overwriting machine has to live off that machine.

5. Automation that fails safe

The thing that turned a bad clone into a bricked standby was that coltrane powered itself off afterward — into a boot it had never verified. So it slept and woke unbootable. Two rules closed that. First: never power off into an unverified boot.

# The safety gate: never power off into a boot we have not verified.
own_uuid="$(findmnt -no UUID /)"           # this machine's real root UUID
if grep -q "$own_uuid" /boot/grub/grub.cfg; then
    poweroff                               # verified: grub.cfg points at our own root
else
    notify "boot re-fit FAILED -- staying up"   # stay reachable; do NOT sleep broken
fi

Second: a clone error must not skip the boot rebuild. By the time a clone can fail, it has already overwritten /boot, so the standby has to re-fit regardless of the clone’s exit code and let the gate above judge the result.

# A clone error must not skip the re-fit: /boot is already overwritten by the time
# the clone can fail, so the standby MUST re-fit or it wakes unbootable.
clone_rc=0
run_clone || clone_rc=$?
(( clone_rc != 0 )) && notify "clone exited $clone_rc -- re-fitting anyway"
refit_boot          # always runs; the gate decides whether the result is safe
Lesson. A standby left running is a nuisance; a standby powered off into a broken boot is a site visit. Bias every automatic decision toward staying recoverable.

The principles, distilled

  1. Host-identity, boot, and volatile data are excluded or re-fitted — never blindly cloned. Filesystem tables, bootloaders, kernels/initramfs, machine IDs, logs, journals.
  2. Stateful services need consistent, application-aware capture — dumps and maintenance-mode, not live-file copies — and their parts (data, database, config) must agree.
  3. Observability and alerting must live somewhere the clone can’t overwrite. A peer, not the patient.
  4. Automation must fail safe. Never sleep into an unverified state; never let one step’s failure silently skip a critical later step.
  5. Disaster Recovery is not High Availability. Know your recovery-point and recovery-time objectives, accept the human in the loop, and remember that failback is harder than failover and an untested plan is only a hypothesis. Rehearse it.

The real lesson: cycles, not steps

The clone “worked” after the design step. Everything that mattered — the BIOS/UEFI mismatch, the UUID hidden in a binary, the journal mutating mid-copy, the alerts dying with the patient — only existed once it had to run. The understanding was produced by the cycles of break-and-look, not specified in advance. I don’t believe anyone could have designed their way to this list up front; I certainly couldn’t, and I think about distributed systems for a living. Operational reality is discovered, not derived.

A recurring sub-lesson: confident, plausible explanations were wrong, repeatedly. The trap each time was reasoning about the system instead of looking at it. The breakthroughs came from inspecting the actual artifact — cat the real file, grep the real tree, read the real boot log — which beat ten elegant theories every time.

A note on the AI collaboration

I worked through much of this with a Claude AI assistant, and I want to be explicit about that, including its shape. It was useful for breadth — surfacing the consistency issue, the boot-identity split, the fail-safe gate, the route-alerts-through-a-peer idea — and for turning each finding into careful, commented scripts. It was also confidently wrong several times: it proposed a BIOS-boot-partition fix for what was a UEFI machine, guessed the bad UUID’s location incorrectly more than once, and floated a permissions theory that a reboot disproved. What made the collaboration work was not the model being right; it was the loop — me running things on the actual machines, pasting back the actual output, and both of us treating the artifact as the authority over the argument. That division of labour — the model for breadth and drafting, the human for ground truth and judgment — is, I think, the honest picture of where these tools help today.

Host names and identifying details have been changed, and the scripts stripped to their teaching essentials. The bugs were real.