Failover Procedures¶

High availability and failover procedures for the CharlieHub cluster.

Cluster Architecture¶

The cluster uses Ceph for primary storage with automatic redundancy:

Mechanism	Scope	Recovery Time
Ceph Replication	All UK nodes (size=3)	Automatic (0 downtime)
Ceph Snapshots	Quick rollback	Seconds
Proxmox HA	VM auto-migration	1-2 minutes
Vzdump Backups	Full VM recovery	10-30 minutes
RBD Export (France)	Cross-site DR	30-60 minutes

How Ceph HA Works¶

With size=3 replication, every write goes to ALL 3 UK nodes:

Write to VM1111
      │
      ├──► osd.1 (px1-silverstone) ✓
      ├──► osd.2 (px2-monza) ✓
      └──► osd.3 (px3-suzuka) ✓

Result: Any single node can fail with zero data loss.

Node Failure Scenarios¶

Single UK Node Failure¶

Impact: None - Ceph continues on remaining 2 nodes, Proxmox HA migrates VMs.

What happens automatically: 1. Ceph detects OSD down, continues with 2 remaining copies 2. Proxmox HA migrates affected VMs to healthy nodes 3. Services resume within 1-2 minutes

Manual verification:

# Check Ceph status
ceph -s

# Check HA status
ha-manager status

# Check VM locations
qm list

px1 (Primary Node) Failure¶

Affects: VM1111, CT1112, CT1113 (if running there)

Automatic Recovery: 1. Proxmox HA detects node down 2. VMs migrate to px2 or px3 3. Ceph data already available on both nodes

Manual verification:

# From px2 or px3
pvecm status
ha-manager status

# VMs should show on new node
qm list

Two UK Nodes Failure (Quorum Lost)¶

Impact: Cluster loses quorum, no writes possible

Recovery:

# If 2 nodes are truly gone, force single-node operation
pvecm expected 1

# On remaining node, check Ceph
ceph -s

# Ceph needs at least min_size=2 copies
# If only 1 OSD remains, set min_size=1 temporarily
ceph osd pool set ceph-pool min_size 1

All UK Nodes Down (Site Failure)¶

Recovery from France (px5):

From RBD exports (fastest):

# On px5
cd /mnt/nvme-vmdata/dr-images/

# Decompress and convert
zstd -d < vm-1111-disk-0-20260103.raw.zst | \
  qemu-img convert -f raw -O qcow2 /dev/stdin /tmp/vm-1111.qcow2

# Create VM
qm create 1111 --name charliehub-dr --memory 8192 --cores 4 \
  --scsi0 local-lvm:0,import-from=/tmp/vm-1111.qcow2
qm start 1111

From vzdump backups:

# List available backups
ls /mnt/pve/pikvm-backup/dump/vzdump-qemu-1111-*.vma.zst

# Restore
qmrestore /mnt/pve/pikvm-backup/dump/vzdump-qemu-1111-latest.vma.zst 1111 \
  --storage local-lvm
qm start 1111

Service Failover¶

VM1111 (CharlieHub Services)¶

Affects: Traefik, Domain Manager, DDNS, Mail, SSH Bastion

Quick Recovery (Ceph snapshot):

# List available snapshots
rbd snap ls ceph-pool/vm-1111-disk-0

# Stop VM
qm stop 1111

# Rollback to snapshot
rbd snap rollback ceph-pool/vm-1111-disk-0@daily-20260103

# Start VM
qm start 1111

Full Recovery (vzdump):

# Find latest backup
ls -la /mnt/backup-storage/dump/vzdump-qemu-1111-*.vma.zst | tail -1

# Restore (destroys current VM)
qmrestore /mnt/backup-storage/dump/vzdump-qemu-1111-*.vma.zst 1111 \
  --storage ceph-pool --force
qm start 1111

CT1112 (PostgreSQL)¶

Affects: All database-dependent services

Quick Recovery:

# Check database integrity first
pct exec 1112 -- pg_isready
pct exec 1112 -- sudo -u postgres psql -c "SELECT 1"

# If corrupt, rollback
pct stop 1112
rbd snap rollback ceph-pool/vm-1112-disk-0@daily-20260103
pct start 1112

CT1113 (IoT Platform)¶

Affects: MQTT, Home Assistant, IoT integrations

Recovery:

# Same pattern as CT1112
pct stop 1113
rbd snap rollback ceph-pool/vm-1113-disk-0@daily-20260103
pct start 1113

Monitoring¶

Primary homelab monitoring: prod-monitoring (CT1115) on px1 Cloud monitoring: hub2 (OVH Dedicated Server)

homelab-monitor (CT1115) Recovery¶

# From px1
pct exec 1115 -- bash -c "cd /opt/monitoring && docker compose restart"

# Verify
curl http://REDACTED_IP5:9090/-/healthy
curl http://REDACTED_IP5:3000/api/health

hub1 Monitoring Recovery¶

# On hub1
cd /opt/charliehub && docker compose restart prometheus grafana

# Verify
curl https://prometheus.charliehub.net/-/healthy
curl https://grafana.charliehub.net/api/health

Network Failover¶

Site-to-Site VPN Down¶

UK (10.44.x.x) and France (10.35.x.x) connected via UniFi SD-WAN.

Diagnosis:

# From px1
ping REDACTED_IP  # px5

Fallback: Use WireGuard VPN via hub2

# From hub2, check WireGuard status
sudo wg show

# SSH via WireGuard
ssh px5  # px5 via WireGuard (REDACTED_IP)

Internet Outage (UK Site)¶

Impact: External access to UK services

Notes: - Internal services continue on 10.44.x.x - hub2 can still reach FR site via WireGuard - Use hub2 as jump host for emergency remote access

Post-Failover Checklist¶

# 1. Check cluster quorum
pvecm status

# 2. Check Ceph health
ceph -s

# 3. Verify HA status
ha-manager status

# 4. Check all VMs running
qm list
pct list

# 5. Check critical services (hub1)
curl -s https://domains.charliehub.net/health     # Domain Manager
curl -s https://prometheus.charliehub.net/-/healthy  # Prometheus

# 6. Verify backups will run
ls /etc/cron.d/ceph-*

HA-Managed VMs¶

These VMs auto-failover via Proxmox HA:

VM/CT	Name	Storage	Auto-Failover
VM1111	charliehub-hub	ceph-pool	✅ Yes
CT1112	prod-database	ceph-pool	✅ Yes
CT1113	prod-iot-platform	ceph-pool	✅ Yes

Documentation References¶

Backup Strategy - Backup locations and retention
Recovery Runbooks - Step-by-step recovery
Cluster Overview - Architecture details