Failover Procedures¶
High availability and failover procedures for the CharlieHub cluster.
Cluster Architecture¶
The cluster uses Ceph for primary storage with automatic redundancy:
| Mechanism | Scope | Recovery Time |
|---|---|---|
| Ceph Replication | All UK nodes (size=3) | Automatic (0 downtime) |
| Ceph Snapshots | Quick rollback | Seconds |
| Proxmox HA | VM auto-migration | 1-2 minutes |
| Vzdump Backups | Full VM recovery | 10-30 minutes |
| RBD Export (France) | Cross-site DR | 30-60 minutes |
How Ceph HA Works¶
With size=3 replication, every write goes to ALL 3 UK nodes:
Write to VM1111
│
├──► osd.1 (px1-silverstone) ✓
├──► osd.2 (px2-monza) ✓
└──► osd.3 (px3-suzuka) ✓
Result: Any single node can fail with zero data loss.
Node Failure Scenarios¶
Single UK Node Failure¶
Impact: None - Ceph continues on remaining 2 nodes, Proxmox HA migrates VMs.
What happens automatically: 1. Ceph detects OSD down, continues with 2 remaining copies 2. Proxmox HA migrates affected VMs to healthy nodes 3. Services resume within 1-2 minutes
Manual verification:
# Check Ceph status
ceph -s
# Check HA status
ha-manager status
# Check VM locations
qm list
px1 (Primary Node) Failure¶
Affects: VM1111, CT1112, CT1113 (if running there)
Automatic Recovery: 1. Proxmox HA detects node down 2. VMs migrate to px2 or px3 3. Ceph data already available on both nodes
Manual verification:
# From px2 or px3
pvecm status
ha-manager status
# VMs should show on new node
qm list
Two UK Nodes Failure (Quorum Lost)¶
Impact: Cluster loses quorum, no writes possible
Recovery:
# If 2 nodes are truly gone, force single-node operation
pvecm expected 1
# On remaining node, check Ceph
ceph -s
# Ceph needs at least min_size=2 copies
# If only 1 OSD remains, set min_size=1 temporarily
ceph osd pool set ceph-pool min_size 1
All UK Nodes Down (Site Failure)¶
Recovery from France (px5):
-
From RBD exports (fastest):
# On px5 cd /mnt/nvme-vmdata/dr-images/ # Decompress and convert zstd -d < vm-1111-disk-0-20260103.raw.zst | \ qemu-img convert -f raw -O qcow2 /dev/stdin /tmp/vm-1111.qcow2 # Create VM qm create 1111 --name charliehub-dr --memory 8192 --cores 4 \ --scsi0 local-lvm:0,import-from=/tmp/vm-1111.qcow2 qm start 1111 -
From vzdump backups:
# List available backups ls /mnt/pve/pikvm-backup/dump/vzdump-qemu-1111-*.vma.zst # Restore qmrestore /mnt/pve/pikvm-backup/dump/vzdump-qemu-1111-latest.vma.zst 1111 \ --storage local-lvm qm start 1111
Service Failover¶
VM1111 (CharlieHub Services)¶
Affects: Traefik, Domain Manager, DDNS, Mail, SSH Bastion
Quick Recovery (Ceph snapshot):
# List available snapshots
rbd snap ls ceph-pool/vm-1111-disk-0
# Stop VM
qm stop 1111
# Rollback to snapshot
rbd snap rollback ceph-pool/vm-1111-disk-0@daily-20260103
# Start VM
qm start 1111
Full Recovery (vzdump):
# Find latest backup
ls -la /mnt/backup-storage/dump/vzdump-qemu-1111-*.vma.zst | tail -1
# Restore (destroys current VM)
qmrestore /mnt/backup-storage/dump/vzdump-qemu-1111-*.vma.zst 1111 \
--storage ceph-pool --force
qm start 1111
CT1112 (PostgreSQL)¶
Affects: All database-dependent services
Quick Recovery:
# Check database integrity first
pct exec 1112 -- pg_isready
pct exec 1112 -- sudo -u postgres psql -c "SELECT 1"
# If corrupt, rollback
pct stop 1112
rbd snap rollback ceph-pool/vm-1112-disk-0@daily-20260103
pct start 1112
CT1113 (IoT Platform)¶
Affects: MQTT, Home Assistant, IoT integrations
Recovery:
# Same pattern as CT1112
pct stop 1113
rbd snap rollback ceph-pool/vm-1113-disk-0@daily-20260103
pct start 1113
Monitoring¶
Primary homelab monitoring: prod-monitoring (CT1115) on px1 Cloud monitoring: hub2 (OVH Dedicated Server)
homelab-monitor (CT1115) Recovery¶
# From px1
pct exec 1115 -- bash -c "cd /opt/monitoring && docker compose restart"
# Verify
curl http://REDACTED_IP5:9090/-/healthy
curl http://REDACTED_IP5:3000/api/health
hub1 Monitoring Recovery¶
# On hub1
cd /opt/charliehub && docker compose restart prometheus grafana
# Verify
curl https://prometheus.charliehub.net/-/healthy
curl https://grafana.charliehub.net/api/health
Network Failover¶
Site-to-Site VPN Down¶
UK (10.44.x.x) and France (10.35.x.x) connected via UniFi SD-WAN.
Diagnosis:
# From px1
ping REDACTED_IP # px5
Fallback: Use WireGuard VPN via hub2
# From hub2, check WireGuard status
sudo wg show
# SSH via WireGuard
ssh px5 # px5 via WireGuard (REDACTED_IP)
Internet Outage (UK Site)¶
Impact: External access to UK services
Notes: - Internal services continue on 10.44.x.x - hub2 can still reach FR site via WireGuard - Use hub2 as jump host for emergency remote access
Post-Failover Checklist¶
# 1. Check cluster quorum
pvecm status
# 2. Check Ceph health
ceph -s
# 3. Verify HA status
ha-manager status
# 4. Check all VMs running
qm list
pct list
# 5. Check critical services (hub1)
curl -s https://domains.charliehub.net/health # Domain Manager
curl -s https://prometheus.charliehub.net/-/healthy # Prometheus
# 6. Verify backups will run
ls /etc/cron.d/ceph-*
HA-Managed VMs¶
These VMs auto-failover via Proxmox HA:
| VM/CT | Name | Storage | Auto-Failover |
|---|---|---|---|
| VM1111 | charliehub-hub | ceph-pool | ✅ Yes |
| CT1112 | prod-database | ceph-pool | ✅ Yes |
| CT1113 | prod-iot-platform | ceph-pool | ✅ Yes |
Documentation References¶
- Backup Strategy - Backup locations and retention
- Recovery Runbooks - Step-by-step recovery
- Cluster Overview - Architecture details