Skip to content

Recovery Runbooks

Step-by-step procedures for disaster recovery scenarios.

Quick Recovery Commands

Fast Local Rollback (Ceph Snapshots - Seconds)

# List available snapshots for VM1111
rbd snap ls ceph-pool/vm-1111-disk-0

# Stop VM and rollback
qm stop 1111
rbd snap rollback ceph-pool/vm-1111-disk-0@daily-20260103
qm start 1111

Restore from Vzdump (Minutes)

# Find backup
ls /mnt/backup-storage/dump/vzdump-qemu-1111-*.vma.zst | tail -1

# Restore
qmrestore /mnt/backup-storage/dump/vzdump-qemu-1111-*.vma.zst 1111 \
  --storage ceph-pool --force

Restore from France RBD Export (30-60 min)

# On px5 - decompress and import
zstd -d < /mnt/nvme-vmdata/dr-images/vm-1111-disk-0-*.raw.zst | \
  qemu-img convert -f raw -O qcow2 /dev/stdin /tmp/vm-1111.qcow2

qm create 1111 --name charliehub-dr --memory 8192 --cores 4 \
  --scsi0 local-lvm:0,import-from=/tmp/vm-1111.qcow2
qm start 1111

Scenario: VM1111 (CharlieHub) Down

Impact: External access, domain routing, SSH bastion, authentication

Quick Fix (< 2 minutes)

  1. Check if VM is running:

    qm status 1111
    ha-manager status
    

  2. If stopped, HA should auto-start. Manual start:

    qm start 1111
    

  3. If on wrong node, HA will migrate automatically. Force if needed:

    ha-manager migrate vm:1111 px1-silverstone
    

Rollback to Snapshot (< 5 minutes)

  1. Stop VM:

    qm stop 1111
    

  2. List and rollback:

    rbd snap ls ceph-pool/vm-1111-disk-0
    rbd snap rollback ceph-pool/vm-1111-disk-0@daily-20260103
    

  3. Start:

    qm start 1111
    

Full Restore from Backup (15-30 minutes)

  1. Remove corrupted VM:

    qm stop 1111
    qm destroy 1111 --purge
    

  2. Restore from vzdump:

    qmrestore /mnt/backup-storage/dump/vzdump-qemu-1111-*.vma.zst 1111 \
      --storage ceph-pool
    

  3. Re-add to HA and start:

    ha-manager add vm:1111 --state started
    

  4. Verify services:

    ssh ubuntu@151.80.58.99 "docker ps"
    curl -s https://docs.charliehub.net
    


Scenario: CT1112 (PostgreSQL) Down

Impact: All database-dependent services

Quick Fix

  1. Check status:

    pct status 1112
    ha-manager status
    

  2. If stopped:

    pct start 1112
    

  3. Verify database:

    pct exec 1112 -- pg_isready
    pct exec 1112 -- sudo -u postgres psql -c "SELECT 1"
    

Rollback to Snapshot

pct stop 1112
rbd snap ls ceph-pool/vm-1112-disk-0
rbd snap rollback ceph-pool/vm-1112-disk-0@daily-20260103
pct start 1112

Full Restore

pct stop 1112
pct destroy 1112 --purge
pct restore 1112 /mnt/backup-storage/dump/vzdump-lxc-1112-*.tar.zst \
  --storage ceph-pool
ha-manager add ct:1112 --state started

Scenario: CT1113 (IoT Platform) Down

Impact: MQTT, Home Assistant, IoT integrations

Quick Fix

pct status 1113
pct start 1113  # If stopped

Rollback

pct stop 1113
rbd snap rollback ceph-pool/vm-1113-disk-0@daily-20260103
pct start 1113

Verify Services

pct exec 1113 -- docker ps
pct exec 1113 -- mosquitto_pub -t test -m "ping"

Scenario: Single Node Failure

Impact: Minimal - Ceph and HA handle automatically

What Happens Automatically

  1. Ceph detects OSD down, continues with 2 remaining copies
  2. Proxmox HA migrates VMs to healthy nodes
  3. Services resume in 1-2 minutes

Verification

# Check cluster
pvecm status

# Check Ceph (will show degraded until node returns)
ceph -s

# Check HA migrated VMs
ha-manager status
qm list

When Node Returns

# Ceph automatically rebalances
ceph -s  # Watch recovery progress

# VMs stay on current nodes (no auto-migration back)

Scenario: Two Nodes Down (Quorum Lost)

Impact: Cluster read-only, no VM operations

Force Single-Node Operation

# On remaining node
pvecm expected 1

# Check Ceph (may need min_size adjustment)
ceph -s

# If only 1 OSD, temporarily allow degraded operation
ceph osd pool set ceph-pool min_size 1

Restore Quorum When Nodes Return

# Nodes rejoin automatically
pvecm status

# Reset expected votes
pvecm expected 3  # or 4 for full cluster

# Reset min_size
ceph osd pool set ceph-pool min_size 2

Scenario: Complete UK Site Failure

Impact: All UK services down, DR activation required

Activate DR on px5 (France)

  1. Assess situation:

    # From px5
    ping REDACTED_IP  # Should fail if UK is down
    
    # From hub2, check WireGuard connectivity
    sudo wg show
    

  2. Restore from RBD exports (fastest):

    cd /mnt/nvme-vmdata/dr-images/
    
    # VM1111
    zstd -d < vm-1111-disk-0-*.raw.zst | \
      qemu-img convert -f raw -O qcow2 /dev/stdin /tmp/vm-1111.qcow2
    qm create 1111 --name charliehub-dr --memory 8192 --cores 4 \
      --scsi0 local-lvm:0,import-from=/tmp/vm-1111.qcow2
    qm start 1111
    

  3. Or restore from vzdump:

    ls /mnt/pve/pikvm-backup/dump/vzdump-qemu-1111-*.vma.zst | tail -1
    qmrestore /mnt/pve/pikvm-backup/dump/vzdump-qemu-1111-*.vma.zst 1111 \
      --storage local-lvm
    qm start 1111
    

  4. Update DNS to point to France site


Scenario: Ceph Cluster Degraded

Check Health

ceph health detail
ceph osd tree
ceph -s

Common Fixes

OSD Down:

# Restart OSD
ssh root@<node> "systemctl restart ceph-osd@<id>"

# Check status
ceph osd tree

Slow Recovery:

# Check recovery progress
ceph -s

# Speed up recovery (temporarily)
ceph tell 'osd.*' injectargs '--osd-max-backfills=4'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active=6'

PGs Stuck:

# Find stuck PGs
ceph pg dump_stuck

# Force recovery
ceph pg repair <pg_id>


Post-Recovery Checklist

# 1. Cluster quorum
pvecm status

# 2. Ceph health
ceph -s

# 3. HA status
ha-manager status

# 4. All VMs running
qm list && pct list

# 5. Critical services (hub1)
curl -s https://domains.charliehub.net/health
curl -s https://prometheus.charliehub.net/-/healthy

# 6. Backups will run
ls /etc/cron.d/ceph-*