Single-Node Kubernetes Disaster Recovery: Backups That Survive a Wiped Docker VM

A single-node minikube cluster on Docker Desktop runs the entire control plane, kubelet, every PVC, every Secret, and the container image cache inside one VM whose disk is one file: ~/Library/Containers/com.docker.docker/Data/vms/0/data/Docker.raw on macOS. When that file is lost or corrupted, every piece of cluster state goes with it in a single event. There is no “node failure vs storage failure” distinction to design around. Every backup strategy that assumes those are separable does not apply.

This article is the single-node companion to Kubernetes Disaster Recovery, which assumes multi-node, etcd-on-disk, and an off-cluster object store. None of those assumptions hold here. For host-level setup that creates this failure domain, see Kubernetes on Apple Silicon Setup Gotchas.

The failure domain that breaks every “in-cluster” backup tool#

The single-VM substrate has three consequences that constrain every choice downstream.

etcd snapshots stored on a hostPath PV are inside the failure domain. They are a slightly newer copy of the thing being recovered from. They die with the cluster.

Velero with its default in-cluster MinIO backend is also inside the failure domain. To do anything useful, Velero needs a remote bucket — S3, B2, GCS — at which point there is an off-cluster dependency, an IAM key on disk, a recurring cost line, and a Helm chart plus a CRD plus a controller pod, all to back up a homelab. Velero is the right tool for multi-node clusters where node failure and storage failure are independent. On single-node, the cost-benefit shifts.

Every PVC, every Secret, every ConfigMap, every container image lives in the same Docker.raw file. A backup strategy that captures only one class is a partial backup. The honest framing: pick what’s worth backing up out of the VM, accept that everything else rebuilds from empty.

Design decisions#

Back up source repos, not PVCs#

Source is small (around 26 MB per day for a 20-repo set), trivially restorable, and contains the intent of every service. PVC contents — Postgres state, message history, Mattermost uploads — are large, change constantly, and require app-aware dump tooling per service. Accepting “rebuild Postgres and Mattermost from empty on restore” is a defensible posture for a single-node lab cluster as long as it’s explicit.

A self-hosted Git forge (Gitea, Forgejo) running on the cluster is itself in the failure domain. Back it up as repos, not as a PVC: the repos are the recoverable artifact; the Gitea database is auth state, webhook secrets, and per-user metadata that’s faster to recreate than to restore.

External drive as primary, cloud as optional second tier#

A USB or Thunderbolt drive survives Docker Desktop wipes, host OS reinstalls, and Docker corruption events. It has no recurring cost and no credentials to leak. The drive itself is a single point of failure — mitigate either by rotating two drives weekly or by adding restic/rclone to off-site object storage as a second tier. Cloud-as-primary is the wrong default for a homelab: if the nightly backup costs money, it gets cancelled within a quarter; if it lives on an external drive on the desk, it survives the next reorg.

Mirror clone, not PVC tarball#

git clone --mirror is a verifiable byte-for-byte copy of every ref — branches, tags, PR refs, notes. Git’s own integrity model does the verification. Restore is git push --mirror — a Git primitive, not a tool-specific import. A tar of the Gitea data PVC is fragile across forge versions, leaks auth tokens and webhook secrets into the backup blast radius, and locks the restore target to the same forge. A mirror clone restores cleanly to a fresh Gitea, to Forgejo, or to GitHub.

Seven-day daily retention, not GFS#

Daily snapshots with seven-day retention catch “a secret was committed three days ago and force-pushed over since” recovery scenarios. Beyond a week the Git history itself is the backup; older snapshots are mostly identical and waste drive space.

Cap Docker Desktop memory explicitly#

Docker Desktop auto-allocates around 60% of host RAM. Under workload on a 64 GiB host, that pushes macOS into jetsam territory and com.docker.backend is the highest-memory process — macOS kills it. Repeated SIGKILL of the backend is what corrupts the Data folder. The DR event is preventable. Cap at roughly 38% of host RAM (24 GiB on 64 GiB) via Docker Desktop → Settings → Resources, or via ~/Library/Group Containers/group.com.docker/settings-store.json with MemoryMiB: 24576. Memory cap is the actual fix; renaming the Data folder is a red herring.

What the backup script does#

The mechanism, step by step. Each step exists to defeat a specific failure mode that bites cron-scheduled backups on macOS.

set -euo pipefail. Fail fast. No silent partial backups.
Cron-safe environment. Explicit absolute paths to kubectl, git, tar, curl, python3, plus an explicit PATH=/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin. Cron’s stripped PATH is the single most common reason nightly backups silently no-op.
Self-bootstrapping port-forward. Ping the service. If unreachable, run kubectl port-forward in the background, capture the PID, register a trap cleanup EXIT to kill it on exit. Removes the dependency on a separate make port-forward shell the operator has to remember to leave running.
Enumerate via API, not a hardcoded list. For Gitea: GET /api/v1/repos/search?owner=$OWNER&limit=50. New repos get backed up automatically the next night. Hardcoded lists drift.
Mirror clone, not regular clone. git clone --mirror preserves every ref. A plain git clone gets the default branch plus remote-tracking branches, and a restore silently loses tags and PR refs.
Per-repo tarball. tar -czf <DEST>/<YYYY-MM-DD>/<repo>.tgz. Per-repo (not one big tarball) so a single repo restores without extracting the rest, and a single corrupt repo doesn’t poison the whole snapshot.
MANIFEST.txt per day. Tab-separated <repo>\t<size_bytes>\t<sha256>\t<HEAD_ref>\t<HEAD_commit>. Verifies integrity without extracting (shasum -a 256), confirms the right HEAD on restore, detects bit-rot on the backup drive.
Day-directory rotation. Loop over $DEST_ROOT/2*, compute age via stat -f %m (BSD/macOS) with a stat -c %Y (GNU) fallback, rm -rf if older than RETENTION_DAYS. Portable between macOS and Linux operators.
Logs separate from snapshots. <DEST>/logs/backup-<date>.log, retained 30 days. When the script silently fails in cron context, the log directory is the first place to look — and logs need to outlive the snapshots.
Non-zero exit on any failure. exit 4 if any single repo failed. Cron surfaces this in the local mail spool or stderr capture, so the operator notices.

A redacted, templated form of the script:

#!/usr/bin/env bash
# Nightly mirror-clone backup of every repo owned by $OWNER on a self-hosted
# Gitea instance, to <DEST_ROOT>/<YYYY-MM-DD>/<repo>.tgz with a sha256 manifest.
set -euo pipefail

# --- absolute paths (cron has a stripped PATH) -----------------------------
export PATH=/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
KUBECTL=/opt/homebrew/bin/kubectl
GIT=/usr/bin/git
TAR=/usr/bin/tar
CURL=/usr/bin/curl
PY=/usr/bin/python3

# --- config (override via env) ---------------------------------------------
OWNER="${OWNER:-<owner>}"
GITEA_USER="${GITEA_USER:-<admin-user>}"
GITEA_PASS="${GITEA_PASS:-<admin-pass>}"
GITEA_HOST="${GITEA_HOST:-localhost:3000}"
GITEA_SVC_NS="${GITEA_SVC_NS:-<gitea-namespace>}"
GITEA_SVC="${GITEA_SVC:-<gitea-svc>}"
DEST_ROOT="${DEST_ROOT:-/Volumes/<your-backup-drive>/gitea-backups}"
RETENTION_DAYS="${RETENTION_DAYS:-7}"
LOG_RETENTION_DAYS="${LOG_RETENTION_DAYS:-30}"

DATE=$(date +%Y-%m-%d)
DEST="$DEST_ROOT/$DATE"
LOG_DIR="$DEST_ROOT/logs"
LOG_FILE="$LOG_DIR/backup-$DATE.log"
mkdir -p "$DEST" "$LOG_DIR"
exec > >(tee -a "$LOG_FILE") 2>&1

# --- self-bootstrap port-forward if needed ---------------------------------
PF_PID=""
cleanup() { [[ -n "$PF_PID" ]] && kill "$PF_PID" 2>/dev/null || true; }
trap cleanup EXIT

if ! "$CURL" -sf "http://$GITEA_HOST/api/v1/version" >/dev/null 2>&1; then
  "$KUBECTL" -n "$GITEA_SVC_NS" port-forward "svc/$GITEA_SVC" 3000:3000 \
    >>"$LOG_FILE" 2>&1 &
  PF_PID=$!
  sleep 3
fi

# --- enumerate repos via API ------------------------------------------------
echo "--- listing repos owned by $OWNER ---"
mapfile -t REPOS < <(
  "$CURL" -sf -u "$GITEA_USER:$GITEA_PASS" \
    "http://$GITEA_HOST/api/v1/repos/search?owner=$OWNER&limit=50" \
    | "$PY" -c 'import json,sys
for r in json.load(sys.stdin)["data"]: print(r["name"])'
)
echo "found ${#REPOS[@]} repos"

MANIFEST="$DEST/MANIFEST.txt"
: > "$MANIFEST"
FAIL=0
TOTAL=0

for repo in "${REPOS[@]}"; do
  url="http://$GITEA_USER:$GITEA_PASS@$GITEA_HOST/$OWNER/$repo.git"
  workdir=$(mktemp -d)
  if ! "$GIT" clone --mirror -q "$url" "$workdir/$repo.git" 2>>"$LOG_FILE"; then
    echo "  $repo  FAIL  clone"; FAIL=$((FAIL+1)); rm -rf "$workdir"; continue
  fi
  head_ref=$("$GIT" -C "$workdir/$repo.git" symbolic-ref HEAD 2>/dev/null || echo "-")
  head_sha=$("$GIT" -C "$workdir/$repo.git" rev-parse HEAD 2>/dev/null || echo "-")
  tarball="$DEST/$repo.tgz"
  ( cd "$workdir" && "$TAR" -czf "$tarball" "$repo.git" )
  size=$(stat -f %z "$tarball" 2>/dev/null || stat -c %s "$tarball")
  sha=$(shasum -a 256 "$tarball" | awk '{print $1}')
  printf "%s\t%s\t%s\t%s\t%s\n" "$repo" "$size" "$sha" "$head_ref" "$head_sha" \
    >> "$MANIFEST"
  TOTAL=$((TOTAL + size))
  printf "  %-40s OK   %s  %s\n" "$repo" \
    "$(printf '%d' "$size" | awk '{printf "%.1fMB", $1/1024/1024}')" \
    "${head_sha:0:8}"
  rm -rf "$workdir"
done

# --- prune old day-dirs -----------------------------------------------------
echo "--- pruning day-dirs older than $RETENTION_DAYS days ---"
now=$(date +%s)
for d in "$DEST_ROOT"/2*; do
  [[ -d "$d" ]] || continue
  mtime=$(stat -f %m "$d" 2>/dev/null || stat -c %Y "$d")
  age_days=$(( (now - mtime) / 86400 ))
  if (( age_days > RETENTION_DAYS )); then
    echo "  prune $d (age ${age_days}d)"
    rm -rf "$d"
  fi
done

# --- summary ----------------------------------------------------------------
echo "DONE $DATE  ok:$((${#REPOS[@]} - FAIL))  fail:$FAIL  size:$((TOTAL/1024/1024))MB"
(( FAIL > 0 )) && exit 4
exit 0

Cron entry (operator’s user crontab):

30 2 * * * /path/to/bootstrap/scripts/backup-gitea-repos.sh

Around 4 seconds wall on a 20-repo set, around 26 MB total daily.

macOS Full Disk Access — the silent failure mode#

macOS Sequoia and several prior releases sandbox cron from /Volumes/* by default. The first nightly run silently fails with no log file produced — the tee -a "$LOG_FILE" itself can’t write. The absence of a log file is the diagnostic.

Fix: System Settings → Privacy & Security → Full Disk Access → + → /usr/sbin/cron (use Cmd+Shift+G in the Finder picker to type the path directly). If managed-device policy blocks Full Disk Access, convert the cron entry to a ~/Library/LaunchAgents/*.plist launchd job — permission prompts are then user-interactive instead of silent.

Restore procedure#

The inverse of the backup. Three primitives.

# 1. Extract tarball — yields a bare <repo>.git directory
cd /tmp && tar -xzf /Volumes/<your-backup-drive>/gitea-backups/2026-05-05/<repo>.tgz

# 2. Verify HEAD against MANIFEST.txt
cd /tmp/<repo>.git && git rev-parse HEAD
# Compare to column 5 of the matching MANIFEST line

# 3. Push to a fresh empty remote
git push --mirror http://<admin-user>:<admin-pass>@<host>/<owner>/<repo>.git

git push --mirror is the inverse of git clone --mirror. It pushes every ref. Without --mirror, the restored repo silently lacks tags and PR refs and the loss is only noticed weeks later.

Integrity verification without restore#

shasum -a 256 /Volumes/<your-backup-drive>/gitea-backups/<date>/<repo>.tgz
# Compare to the third tab-separated column in MANIFEST.txt

Run after every backup-drive change — new drive, drive moved between machines, suspicious sounds. Catches bit-rot and bad-cable corruption before the backup is needed.

When no script existed yet — recovering from ad-hoc local clones#

When the disaster predates the backup script, the recovery procedure is different:

Find local checkouts. A code-indexer, a librarian agent, a developer-laptop checkout — anything that happens to have full clones from before the wipe. Check external drives first, before any destructive recovery dance.
For each repo: POST /api/v1/user/repos to create the empty target, then git push --all and git push --tags to the new remote.
Re-wire CI: webhooks, branch protection, deploy keys.

This procedure is what runs once. The scheduled script is what runs forever after.

A real incident: how 13 repos were nearly lost#

The script in this article exists because of a specific event. The chronology is worth telling because it shows what actually goes wrong, in what order, and how long it takes to fix systemically.

Day 0, three days before the wipe. A cluster resource change required restarting Docker Desktop. The operator ran minikube delete to free the old VM and create a new one with adjusted memory. minikube delete is not a resource-change command — it is a data-destruction command that happens to free resources. Every PVC, every Secret, every event in the hub state, every Mattermost message, every backlog item: gone. The forensic trail was four lines:

🔥  Deleting "minikube" in docker ...
🔥  Deleting container "minikube" ...
🔥  Removing /Users/<user>/.minikube/machines/minikube ...
💀  Removed all traces of the "minikube" profile.

Lesson learned, no backup script written yet.

Day 0, the wipe itself. Docker Desktop’s default memory allocation on a 64 GiB host put it around 38 GiB. Under workload, macOS jetsam killed com.docker.backend repeatedly. The exact signature in ~/Library/Containers/com.docker.docker/Data/log/host/com.docker.backend.log:

agent-api: context cancelled
desktop state:ExitHealthyState
backend cancelled with error: <nil>
  at backend.go:560

After enough SIGKILLs, the Docker.raw file was corrupt. The Docker VM would not start. Renaming the Data folder (the support-forum advice) didn’t help — the underlying cause was memory pressure, not file-layout pathology. The cluster was gone: every PVC, every container image, every cached layer, every persistent service.

Day 0, the inventory. Around 13 repos were declared lost. During a rebuild it’s tempting to skip restoring repos that look deprecated or replaceable; the operator had assumed those repos were recoverable from a recent push and moved on. They weren’t. The remote was the cluster.

Day 0, the side-channel windfall. An indexer agent had cloned every repo to an external drive a few days earlier for unrelated reasons. The clones were complete, recent, and on a different physical disk. Recovery took an evening: create empty repos through the Gitea API, push every ref, re-wire CI hooks. Nothing was actually lost.

Day 9, the systemic fix. The backup script in the previous section landed nine days after the incident. Honest reporting: a backup that depends on an unrelated agent happening to have a recent local clone is not a backup — it’s a coincidence that worked once.

Lessons that survived the incident:

minikube delete is not a resource-change command — it is a data-destruction command that happens to free resources.
Docker Desktop’s auto-allocated memory is the disaster you’re recovering from; capping it is cheaper than restoring from backup.
Recovery from a side-channel (a developer’s local clone) is not a backup strategy. It’s how you find out you needed one.

What the script does NOT back up#

Be explicit about it, because the framing “complete DR posture” depends on knowing what’s accepted as loss.

Class	In script?	Recovery posture
Gitea repos (every ref, every tag)	Yes	`git push --mirror` from tarball
Gitea database (users, hooks, tokens)	No	Recreate from declarative config
PostgreSQL data (app state)	No	Rebuild from empty; round-2 add `kubectl exec ... pg_dump` if needed
Mattermost messages, uploads	No	Accept as loss on a lab cluster
Container images	No	Rebuild from source
Secrets, ConfigMaps	No	Recreate from sealed-secrets manifests in the repo
etcd state	No	Rebuild on `kubectl apply` from manifests in the backed-up repos

The defensible posture for a single-node lab cluster is: back up the source of truth (repos), accept that everything derived from it rebuilds. If application state matters, add a second cron job that does kubectl exec <postgres-pod> -- pg_dump -U <user> <db> > $DEST/<date>/db.sql and lives next to the repo backups. The pattern is identical: external drive, manifest, retention prune.

Diagnostic signatures#

A successful run lists each repo with OK <size> <short-sha>, prunes day-dirs older than retention, and ends with DONE <date> ok:N fail:0 size:NMB. Three common silent-failure modes:

Full Disk Access denied. No log file at all at the next-night path. The absence is the diagnostic.
Cron PATH stripped. Log file exists but contains kubectl: command not found or git: command not found. Fix is the absolute-path constants at the top of the script.
Port-forward race. Log shows curl: (7) Failed to connect to localhost port 3000. Increase the sleep 3 after the background port-forward to sleep 5, or add a retry loop polling /api/v1/version.

Generalizing beyond minikube on macOS#

The mechanism — mirror clone plus per-repo tarball plus sha256 manifest plus retention prune plus external disk — generalizes cleanly to any single-node Kubernetes setup hosting a Git forge. The failure modes do not.

On Linux single-node setups (k3s, kind, k0s, microk8s), the equivalent failure mode is /var/lib/docker filling the host disk, or the host disk itself dying. The script works the same; the macOS-specific caveats (Docker.raw, jetsam, Full Disk Access on /Volumes) drop out and are replaced by ext4/btrfs/zfs concerns and systemd-cron paths.

The single-node DR principle is forge-agnostic and OS-agnostic: back up out of the VM — out of whatever the substrate is — or there is no backup. Everything else is implementation detail.