{"id":1120,"date":"2026-06-07T21:51:35","date_gmt":"2026-06-07T19:51:35","guid":{"rendered":"https:\/\/www.distributed-systems.net\/?page_id=1120"},"modified":"2026-06-10T22:56:29","modified_gmt":"2026-06-10T20:56:29","slug":"ds-infra","status":"publish","type":"page","link":"https:\/\/www.distributed-systems.net\/index.php\/ds-infra\/","title":{"rendered":"A small distributed system"},"content":{"rendered":"\n<h1>For the curious: the system that hosts this Website<\/h1>\n<p>I&#8217;ll be placing information on the distributed system that is running this and other sites. The main reason for running my own stuff is to learn. So many things can, and will, go wrong. Already for many years, I&#8217;ve been advocating that we shouldn&#8217;t be <strong>programming<\/strong> distributed systems, but instead <strong>composing<\/strong> them. This means gluing pieces of software, hardware, and actual machines together into a system. That shifts the emphasis to the glue, which consists of scripts, cron jobs, and whatnot. It also requires being able to switch stuff off, and more importantly, back on. Remotely. I will gradually add material, hopefully inspiring others to follow similar steps and avoid the mistakes I made.<\/p>\n<p>In the advent of AI, I&#8217;ve been handing my scripts to Claude for assessment and improvement. It has led to real improvements and, to me, illustrates a way to use such powerful tools. One important thing to keep in mind: I have the idea and the overall solution; I feed Claude with an initial script that does most of the work, and then we take it from there. It&#8217;s one of the few ways to make sure you keep understanding what&#8217;s going on.<\/p>\n<p>For starters, here&#8217;s a story of the most difficult part of the system: cloning a main server to a warm standby. Thanks to the interaction with Claude, the initial script turned into something truly useful.<\/p>\n<div id=\"clone-story\"><style>\n@import url('https:\/\/fonts.googleapis.com\/css2?family=JetBrains+Mono:wght@400;500&display=swap');\n#clone-story{\n  --ink:#22201c; --muted:#6f685e; --paper:#fbf9f5; --accent:#b3492d;\n  --rule:#e4ded3; --code-bg:#f3efe7;\n  padding:2.5rem 0 4rem;\n}\n#clone-story a{color:var(--accent); text-decoration:none;\n  border-bottom:1px solid rgba(179,73,45,.35);}\n#clone-story pre{background:var(--code-bg); border-left:3px solid var(--accent);\n  border-radius:0 6px 6px 0; padding:1rem 1.1rem; overflow-x:auto;\n  margin:1.2rem 0; font-size:.84rem; line-height:1.55;}\n#clone-story code{font-family:\"JetBrains Mono\",ui-monospace,monospace;}\n#clone-story p code,#clone-story li code{background:var(--code-bg);\n  padding:.08em .35em; border-radius:3px; font-size:.85em;}\n#clone-story pre code{background:none; padding:0; white-space:pre; color:#2c2a25;}\n#clone-story .lesson{background:#fff; border:1px solid var(--rule);\n  border-radius:8px; padding:.85rem 1.1rem; margin:1.3rem 0; font-size:1rem;}\n#clone-story .lesson b{color:var(--accent); }\n#clone-story .foot{margin-top:2.6rem; padding-top:1.2rem;\n  border-top:1px solid var(--rule); font-size:.92rem; color:var(--muted);\n  font-style:italic;}\n<\/style>\n<h2><code><span style=\"font-size: x-large;\">rsync -a<\/span><\/code> Is Not &#8220;Clone a Machine&#8221;<\/h2>\n<p><em>Notes from building a warm standby \u2014 and everything that only showed up once it had to run for real.<\/em><\/p>\n<h3>The premise<\/h3>\n<p>I keep a second server at a different location as a warm standby. The idea is simple, and on paper it takes two steps. <strong>Design:<\/strong> a main server (I&#8217;ll call it <code>zappa<\/code>) runs my services; a standby (<code>coltrane<\/code>) at another site is a faithful clone; if <code>zappa<\/code> dies I switch DNS to <code>coltrane<\/code>. <strong>Implementation:<\/strong> an <code>rsync<\/code> of the whole filesystem through a daily wake-up for <code>coltrane<\/code>, and a small always-on machine (<code>horton<\/code>) at the standby site to capture the fast-changing bits. Done.<\/p>\n<p>It was not done. What follows is the gap between that two-step picture and what it actually took \u2014 because the interesting part is precisely the things that no amount of up-front design surfaced, and that only experimentation did. (Host names are pseudonyms; I&#8217;ve stripped domains, addresses, ports, UUIDs, and the service inventory. More on why at the end. The code is sanitized and uses placeholders such as <code>$STANDBY<\/code> and <code>$PEER<\/code>.)<\/p>\n<h3>The shape of the thing<\/h3>\n<ul>\n<li><code>zappa<\/code> \u2014 the main server. Runs a website that generates downloadable copies of a book on request, a self-hosted cloud, mail, and the usual.<\/li>\n<li><code>coltrane<\/code> \u2014 the standby, at a second site. Powered <strong>off<\/strong> most of the day; it wakes once a day, pulls a full clone from <code>zappa<\/code>, and goes back to sleep.<\/li>\n<li><code>horton<\/code> \u2014 the peer, a small, always-on machine at the standby site. It captures the volatile data on a tight cadence (downloaded copies every few minutes, incoming mail, database dumps), because <code>coltrane<\/code> is asleep and can&#8217;t.<\/li>\n<li>Failover is <strong>manual<\/strong>: switch DNS, bring <code>coltrane<\/code> up as primary.<\/li>\n<\/ul>\n<p>This is disaster recovery, not high availability \u2014 a human in the loop, minutes-to-hours of downtime accepted in exchange for something I can fully understand and repair myself. That framing matters, and I&#8217;ll come back to it.<\/p>\n<h4><em>1. A full-filesystem copy is not a consistent copy<\/em><\/h4>\n<p>The naive <code>rsync -a --delete \/<\/code> of a running system copies a live database file-by-file while the database is writing to it. The result is a torn, possibly unrestorable copy. The fix is to capture <em>logical<\/em> state \u2014 a dump \u2014 and let it ride along with the file sync:<\/p>\n<pre><code># Capture each database as a consistent logical snapshot, not a byte copy of a\n# file the database is still writing to. --single-transaction gives a consistent\n# view of InnoDB tables with no locking and no downtime.\nfor db in app_db cloud_db; do\n    tmp=\"$DUMPDIR\/$db.sql.gz.tmp\"\n    mysqldump --single-transaction --quick --routines --triggers \"$db\" \\\n        | gzip &gt; \"$tmp\"\n    mv \"$tmp\" \"$DUMPDIR\/$db.sql.gz\"   # atomic: a reader never sees a half-written dump done<\/code><\/pre>\n<p>The self-hosted cloud drove this home because it isn&#8217;t &#8220;files.&#8221; It&#8217;s a data directory <strong>plus<\/strong> a database <strong>plus<\/strong> config, and they have to agree. Restoring an old database against newer files \u2014 or vice versa \u2014 produces dangling references the moment a client reconnects. Worse, a client that synced <em>past<\/em> the restored state can read the rolled-back server as &#8220;files were deleted&#8221; and propagate those deletions. The clean answer was a choice: capture the database and files together and consistently, or simply <strong>stall<\/strong> the cloud during failover (maintenance mode) so no client ever syncs against a stale server. A paused server is safe; a rolled-back one is not.<\/p>\n<div class=\"lesson\"><b>Lesson.<\/b> Stateful services need consistent, application-aware capture \u2014 not a byte copy of whatever happened to be on disk at 3\u00a0a.m.<\/div>\n<h4><em>2. Where <code><span style=\"font-size: large;\">--delete<\/span><\/code> meets reality<\/em><\/h4>\n<p>A mirror needs <code>--delete<\/code>, and <code>--delete<\/code> interacts with everything else in ways I kept rediscovering. Excluded files <em>protect<\/em> directories from deletion, so a directory that vanishes on the source but still holds excluded files can&#8217;t be emptied (&#8220;cannot delete non-empty directory&#8221;). Live files change mid-transfer: the systemd journal is written continuously, so <code>rsync<\/code> copies it, the checksum no longer matches, and it reports &#8220;failed verification \u2014 update discarded&#8221; \u2014 and deleting it doesn&#8217;t help because it&#8217;s regenerated instantly. And <code>--delete<\/code> will happily walk <em>into<\/em> a mount point the target has but the source lacks, deleting through it into another machine entirely.<\/p>\n<pre><code># Exclude every virtual \/ pseudo \/ FUSE mountpoint on BOTH machines.\n# --one-file-system stops the SENDER crossing its own mounts, but --delete runs\n# on the RECEIVER: a mount the standby has, that the source lacks, would otherwise\n# look \"extraneous\" and be deleted THROUGH -- into another machine entirely.\nfindmnt -rno TARGET,FSTYPE                 &gt;  mounts.tmp\nssh \"$STANDBY\" findmnt -rno TARGET,FSTYPE  &gt;&gt; mounts.tmp\nawk '$2 ~ \/^(proc|sysfs|tmpfs|devtmpfs|cgroup2?|devpts|mqueue)$\/ || $2 ~ \/^fuse\/ \\\n     { print $1 }' mounts.tmp | sort -u &gt; mount-excludes\n\nrsync -a --delete --one-file-system \\\n      --max-delete=5000 \\                  # a runaway delete alarms, never wipes silently\n      --exclude-from=host-excludes \\       # \/etc\/fstab, \/etc\/hostname, \/boot\/efi,\n      --exclude-from=mount-excludes \\      #   \/var\/log\/journal, \/etc\/machine-id, ...\n      \/  \"$STANDBY\":\/  || rc=$?\n# exit 24 (a file vanished mid-transfer) is harmless; anything else is a failure.<\/code><\/pre>\n<div class=\"lesson\"><b>Lesson.<\/b> The volatile and the host-specific don&#8217;t belong in a clone. Exclude them deliberately \u2014 and remember an exclude list is also a <em>protection<\/em> list under <code>--delete<\/code>.<\/div>\n<h4><em>3. The boot \u2014 where &#8220;clone a machine&#8221; really breaks<\/em><\/h4>\n<p>This is the part I&#8217;d most badly underestimated. Of course you cannot copy one machine&#8217;s boot configuration onto different hardware and generally expect it to boot. I knew that and thought I had it handled. It turned out to be a nasty part. The filesystem table references disks by the <em>source&#8217;s<\/em> UUIDs, which don&#8217;t exist on the target. The kernel&#8217;s initramfs is built for the source&#8217;s storage controller; mine was built for an NVMe root while the standby boots from SATA, so the standby couldn&#8217;t find its own disk and hung in the initramfs with &#8220;device does not exist.&#8221; The bootloader configuration likewise points at the source&#8217;s root.<\/p>\n<p>The right model is a split: the <strong>data<\/strong> clones freely; the <strong>boot stack<\/strong> is host-specific and must either be excluded so the target keeps its own, or rebuilt on the target after each clone. I did both \u2014 excluding the identity bits, and rebuilding the standby&#8217;s boot for its own hardware on every wake:<\/p>\n<pre><code># After the clone, rebuild THIS machine's boot for ITS OWN hardware.\n# Path-based only -- never a \/dev\/sdX device, because disk names differ per machine.\nupdate-initramfs -u -k all                 # initrd carrying the standby's own drivers\ngrub-install --target=x86_64-efi --efi-directory=\/boot\/efi --bootloader-id=ubuntu\nupdate-grub<\/code><\/pre>\n<p>And then the genuinely humbling one. The standby kept booting with the <em>source&#8217;s<\/em> root UUID, and I spent hours grepping every bootloader config file for that UUID. Nothing. It wasn&#8217;t in any config file. It was baked into the bootloader&#8217;s <strong>EFI core image<\/strong> \u2014 a binary \u2014 by an install that had once run while the root was misidentified. Compounding it, the source was a UEFI machine still carrying <em>BIOS-flavoured<\/em> bootloader packages, so the clone kept dragging a boot stack that couldn&#8217;t even install correctly on the UEFI\/GPT standby. The recovery tool I leaned on &#8220;fixed&#8221; it every time, which neatly hid that the next clone re-broke it.<\/p>\n<div class=\"lesson\"><b>Lesson.<\/b> The bad value is not always where it&#8217;s <em>legible<\/em>. I was reasoning about config files; the truth was in a binary. The moment I stopped grepping configs and searched the whole boot tree for the string \u2014 binaries included \u2014 it fell out in one line.<\/div>\n<h4><em>4. When the patient reports its own death<\/em><\/h4>\n<p>Here&#8217;s a subtle one. When the thing you&#8217;re cloning is also the thing that reports errors, a failed clone goes <strong>silent<\/strong>. The clone overwrites the standby&#8217;s own mail configuration mid-run, and then the standby powers itself off before any alert is delivered. I changed the destination address and still got nothing \u2014 because the message never left the box. The fix is to write anything that must survive the clone \u2014 alerts, health logs \u2014 somewhere the clone isn&#8217;t clobbering. I routed both through the always-on peer:<\/p>\n<pre><code># This machine's own mail is overwritten by the clone (and it may power off before\n# delivery), so send alerts through an intact PEER over SSH. Fall back to local\n# mail only if the peer is unreachable.\nnotify() {\n    printf '%s\\n' \"$2\" \\\n        | ssh \"$PEER\" \"mail -s '$1' '$ADMIN'\" 2&gt;\/dev\/null \\\n      || printf '%s\\n' \"$2\" | mail -s \"$1\" \"$ADMIN\" 2&gt;\/dev\/null \\\n      || true\n}<\/code><\/pre>\n<div class=\"lesson\"><b>Lesson.<\/b> Observability for a self-overwriting machine has to live off that machine.<\/div>\n<h4><em>5. Automation that fails safe<\/em><\/h4>\n<p>The thing that turned a bad clone into a <em>bricked<\/em> standby was that <code>coltrane<\/code> powered itself off afterward \u2014 into a boot it had never verified. So it slept and woke unbootable. Two rules closed that. First: never power off into an unverified boot.<\/p>\n<pre><code># The safety gate: never power off into a boot we have not verified.\nown_uuid=\"$(findmnt -no UUID \/)\"           # this machine's real root UUID\nif grep -q \"$own_uuid\" \/boot\/grub\/grub.cfg; then\n    poweroff                               # verified: grub.cfg points at our own root\nelse\n    notify \"boot re-fit FAILED -- staying up\"   # stay reachable; do NOT sleep broken\nfi<\/code><\/pre>\n<p>Second: a clone error must not skip the boot rebuild. By the time a clone can fail, it has already overwritten <code>\/boot<\/code>, so the standby has to re-fit <em>regardless<\/em> of the clone&#8217;s exit code and let the gate above judge the result.<\/p>\n<pre><code># A clone error must not skip the re-fit: \/boot is already overwritten by the time\n# the clone can fail, so the standby MUST re-fit or it wakes unbootable.\nclone_rc=0\nrun_clone || clone_rc=$?\n(( clone_rc != 0 )) &amp;&amp; notify \"clone exited $clone_rc -- re-fitting anyway\"\nrefit_boot          # always runs; the gate decides whether the result is safe<\/code><\/pre>\n<div class=\"lesson\"><b>Lesson.<\/b> A standby left running is a nuisance; a standby powered off into a broken boot is a site visit. Bias every automatic decision toward staying recoverable.<\/div>\n<h3>The principles, distilled<\/h3>\n<ol>\n<li><strong>Host-identity, boot, and volatile data are excluded or re-fitted \u2014 never blindly cloned.<\/strong> Filesystem tables, bootloaders, kernels\/initramfs, machine IDs, logs, journals.<\/li>\n<li><strong>Stateful services need consistent, application-aware capture<\/strong> \u2014 dumps and maintenance-mode, not live-file copies \u2014 and their parts (data, database, config) must agree.<\/li>\n<li><strong>Observability and alerting must live somewhere the clone can&#8217;t overwrite.<\/strong> A peer, not the patient.<\/li>\n<li><strong>Automation must fail safe.<\/strong> Never sleep into an unverified state; never let one step&#8217;s failure silently skip a critical later step.<\/li>\n<li><strong>Disaster Recovery is not High Availability.<\/strong> Know your <a href=\"https:\/\/en.wikipedia.org\/wiki\/IT_disaster_recovery#Recovery_Point_Objective\" target=\"blank\">recovery-point and recovery-time objectives<\/a>, accept the human in the loop, and remember that <em>failback<\/em> is harder than failover and an untested plan is only a hypothesis. Rehearse it.<\/li>\n<\/ol>\n<h3>The real lesson: cycles, not steps<\/h3>\n<p>The clone &#8220;worked&#8221; after the design step. Everything that mattered \u2014 the BIOS\/UEFI mismatch, the UUID hidden in a binary, the journal mutating mid-copy, the alerts dying with the patient \u2014 only existed once it had to run. The understanding was <em>produced<\/em> by the cycles of break-and-look, not specified in advance. I don&#8217;t believe anyone could have designed their way to this list up front; I certainly couldn&#8217;t, and I think about distributed systems for a living. Operational reality is discovered, not derived.<\/p>\n<p>A recurring sub-lesson: confident, plausible explanations were wrong, repeatedly. The trap each time was reasoning <em>about<\/em> the system instead of looking <em>at<\/em> it. The breakthroughs came from inspecting the actual artifact \u2014 <code>cat<\/code> the real file, <code>grep<\/code> the real tree, read the real boot log \u2014 which beat ten elegant theories every time.<\/p>\n<h3>A note on the AI collaboration<\/h3>\n<p>I worked through much of this with a Claude AI assistant, and I want to be explicit about that, including its shape. It was useful for breadth \u2014 surfacing the consistency issue, the boot-identity split, the fail-safe gate, the route-alerts-through-a-peer idea \u2014 and for turning each finding into careful, commented scripts. It was also confidently <strong>wrong<\/strong> several times: it proposed a BIOS-boot-partition fix for what was a UEFI machine, guessed the bad UUID&#8217;s location incorrectly more than once, and floated a permissions theory that a reboot disproved. What made the collaboration work was not the model being right; it was the loop \u2014 me running things on the actual machines, pasting back the actual output, and both of us treating the artifact as the authority over the argument. That division of labour \u2014 the model for breadth and drafting, the human for ground truth and judgment \u2014 is, I think, the honest picture of where these tools help today.<\/p>\n<p class=\"foot\">Host names and identifying details have been changed, and the scripts stripped to their teaching essentials. The bugs were real.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>For the curious: the system that hosts this Website I&#8217;ll be placing information on the distributed system that is running this and other sites. The main reason for running my <a class=\"more-link\" href=\"https:\/\/www.distributed-systems.net\/index.php\/ds-infra\/\">Continue Reading \u2192<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-1120","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/www.distributed-systems.net\/index.php\/wp-json\/wp\/v2\/pages\/1120","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.distributed-systems.net\/index.php\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.distributed-systems.net\/index.php\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.distributed-systems.net\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.distributed-systems.net\/index.php\/wp-json\/wp\/v2\/comments?post=1120"}],"version-history":[{"count":44,"href":"https:\/\/www.distributed-systems.net\/index.php\/wp-json\/wp\/v2\/pages\/1120\/revisions"}],"predecessor-version":[{"id":1192,"href":"https:\/\/www.distributed-systems.net\/index.php\/wp-json\/wp\/v2\/pages\/1120\/revisions\/1192"}],"wp:attachment":[{"href":"https:\/\/www.distributed-systems.net\/index.php\/wp-json\/wp\/v2\/media?parent=1120"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}