Recovery Runbooks¶
Step-by-step procedures for disaster recovery scenarios.
Quick Recovery Commands¶
Fast Local Rollback (Ceph Snapshots - Seconds)¶
# List available snapshots for VM1111
rbd snap ls ceph-pool/vm-1111-disk-0
# Stop VM and rollback
qm stop 1111
rbd snap rollback ceph-pool/vm-1111-disk-0@daily-20260103
qm start 1111
Restore from Vzdump (Minutes)¶
# Find backup
ls /mnt/backup-storage/dump/vzdump-qemu-1111-*.vma.zst | tail -1
# Restore
qmrestore /mnt/backup-storage/dump/vzdump-qemu-1111-*.vma.zst 1111 \
--storage ceph-pool --force
Restore from France RBD Export (30-60 min)¶
# On px5 - decompress and import
zstd -d < /mnt/nvme-vmdata/dr-images/vm-1111-disk-0-*.raw.zst | \
qemu-img convert -f raw -O qcow2 /dev/stdin /tmp/vm-1111.qcow2
qm create 1111 --name charliehub-dr --memory 8192 --cores 4 \
--scsi0 local-lvm:0,import-from=/tmp/vm-1111.qcow2
qm start 1111
Scenario: VM1111 (CharlieHub) Down¶
Impact: External access, domain routing, SSH bastion, authentication
Quick Fix (< 2 minutes)¶
-
Check if VM is running:
qm status 1111 ha-manager status -
If stopped, HA should auto-start. Manual start:
qm start 1111 -
If on wrong node, HA will migrate automatically. Force if needed:
ha-manager migrate vm:1111 px1-silverstone
Rollback to Snapshot (< 5 minutes)¶
-
Stop VM:
qm stop 1111 -
List and rollback:
rbd snap ls ceph-pool/vm-1111-disk-0 rbd snap rollback ceph-pool/vm-1111-disk-0@daily-20260103 -
Start:
qm start 1111
Full Restore from Backup (15-30 minutes)¶
-
Remove corrupted VM:
qm stop 1111 qm destroy 1111 --purge -
Restore from vzdump:
qmrestore /mnt/backup-storage/dump/vzdump-qemu-1111-*.vma.zst 1111 \ --storage ceph-pool -
Re-add to HA and start:
ha-manager add vm:1111 --state started -
Verify services:
ssh ubuntu@151.80.58.99 "docker ps" curl -s https://docs.charliehub.net
Scenario: CT1112 (PostgreSQL) Down¶
Impact: All database-dependent services
Quick Fix¶
-
Check status:
pct status 1112 ha-manager status -
If stopped:
pct start 1112 -
Verify database:
pct exec 1112 -- pg_isready pct exec 1112 -- sudo -u postgres psql -c "SELECT 1"
Rollback to Snapshot¶
pct stop 1112
rbd snap ls ceph-pool/vm-1112-disk-0
rbd snap rollback ceph-pool/vm-1112-disk-0@daily-20260103
pct start 1112
Full Restore¶
pct stop 1112
pct destroy 1112 --purge
pct restore 1112 /mnt/backup-storage/dump/vzdump-lxc-1112-*.tar.zst \
--storage ceph-pool
ha-manager add ct:1112 --state started
Scenario: CT1113 (IoT Platform) Down¶
Impact: MQTT, Home Assistant, IoT integrations
Quick Fix¶
pct status 1113
pct start 1113 # If stopped
Rollback¶
pct stop 1113
rbd snap rollback ceph-pool/vm-1113-disk-0@daily-20260103
pct start 1113
Verify Services¶
pct exec 1113 -- docker ps
pct exec 1113 -- mosquitto_pub -t test -m "ping"
Scenario: Single Node Failure¶
Impact: Minimal - Ceph and HA handle automatically
What Happens Automatically¶
- Ceph detects OSD down, continues with 2 remaining copies
- Proxmox HA migrates VMs to healthy nodes
- Services resume in 1-2 minutes
Verification¶
# Check cluster
pvecm status
# Check Ceph (will show degraded until node returns)
ceph -s
# Check HA migrated VMs
ha-manager status
qm list
When Node Returns¶
# Ceph automatically rebalances
ceph -s # Watch recovery progress
# VMs stay on current nodes (no auto-migration back)
Scenario: Two Nodes Down (Quorum Lost)¶
Impact: Cluster read-only, no VM operations
Force Single-Node Operation¶
# On remaining node
pvecm expected 1
# Check Ceph (may need min_size adjustment)
ceph -s
# If only 1 OSD, temporarily allow degraded operation
ceph osd pool set ceph-pool min_size 1
Restore Quorum When Nodes Return¶
# Nodes rejoin automatically
pvecm status
# Reset expected votes
pvecm expected 3 # or 4 for full cluster
# Reset min_size
ceph osd pool set ceph-pool min_size 2
Scenario: Complete UK Site Failure¶
Impact: All UK services down, DR activation required
Activate DR on px5 (France)¶
-
Assess situation:
# From px5 ping REDACTED_IP # Should fail if UK is down # From hub2, check WireGuard connectivity sudo wg show -
Restore from RBD exports (fastest):
cd /mnt/nvme-vmdata/dr-images/ # VM1111 zstd -d < vm-1111-disk-0-*.raw.zst | \ qemu-img convert -f raw -O qcow2 /dev/stdin /tmp/vm-1111.qcow2 qm create 1111 --name charliehub-dr --memory 8192 --cores 4 \ --scsi0 local-lvm:0,import-from=/tmp/vm-1111.qcow2 qm start 1111 -
Or restore from vzdump:
ls /mnt/pve/pikvm-backup/dump/vzdump-qemu-1111-*.vma.zst | tail -1 qmrestore /mnt/pve/pikvm-backup/dump/vzdump-qemu-1111-*.vma.zst 1111 \ --storage local-lvm qm start 1111 -
Update DNS to point to France site
Scenario: Ceph Cluster Degraded¶
Check Health¶
ceph health detail
ceph osd tree
ceph -s
Common Fixes¶
OSD Down:
# Restart OSD
ssh root@<node> "systemctl restart ceph-osd@<id>"
# Check status
ceph osd tree
Slow Recovery:
# Check recovery progress
ceph -s
# Speed up recovery (temporarily)
ceph tell 'osd.*' injectargs '--osd-max-backfills=4'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active=6'
PGs Stuck:
# Find stuck PGs
ceph pg dump_stuck
# Force recovery
ceph pg repair <pg_id>
Post-Recovery Checklist¶
# 1. Cluster quorum
pvecm status
# 2. Ceph health
ceph -s
# 3. HA status
ha-manager status
# 4. All VMs running
qm list && pct list
# 5. Critical services (hub1)
curl -s https://domains.charliehub.net/health
curl -s https://prometheus.charliehub.net/-/healthy
# 6. Backups will run
ls /etc/cron.d/ceph-*