Skip to content

Failover Procedures

High availability and failover procedures for the CharlieHub cluster.

Cluster Architecture

The cluster uses Ceph for primary storage with automatic redundancy:

Mechanism Scope Recovery Time
Ceph Replication All UK nodes (size=3) Automatic (0 downtime)
Ceph Snapshots Quick rollback Seconds
Proxmox HA VM auto-migration 1-2 minutes
Vzdump Backups Full VM recovery 10-30 minutes
RBD Export (France) Cross-site DR 30-60 minutes

How Ceph HA Works

With size=3 replication, every write goes to ALL 3 UK nodes:

Write to VM1111
      │
      ├──► osd.1 (px1-silverstone) ✓
      ├──► osd.2 (px2-monza) ✓
      └──► osd.3 (px3-suzuka) ✓

Result: Any single node can fail with zero data loss.

Node Failure Scenarios

Single UK Node Failure

Impact: None - Ceph continues on remaining 2 nodes, Proxmox HA migrates VMs.

What happens automatically: 1. Ceph detects OSD down, continues with 2 remaining copies 2. Proxmox HA migrates affected VMs to healthy nodes 3. Services resume within 1-2 minutes

Manual verification:

# Check Ceph status
ceph -s

# Check HA status
ha-manager status

# Check VM locations
qm list

px1 (Primary Node) Failure

Affects: VM1111, CT1112, CT1113 (if running there)

Automatic Recovery: 1. Proxmox HA detects node down 2. VMs migrate to px2 or px3 3. Ceph data already available on both nodes

Manual verification:

# From px2 or px3
pvecm status
ha-manager status

# VMs should show on new node
qm list

Two UK Nodes Failure (Quorum Lost)

Impact: Cluster loses quorum, no writes possible

Recovery:

# If 2 nodes are truly gone, force single-node operation
pvecm expected 1

# On remaining node, check Ceph
ceph -s

# Ceph needs at least min_size=2 copies
# If only 1 OSD remains, set min_size=1 temporarily
ceph osd pool set ceph-pool min_size 1

All UK Nodes Down (Site Failure)

Recovery from France (px5):

  1. From RBD exports (fastest):

    # On px5
    cd /mnt/nvme-vmdata/dr-images/
    
    # Decompress and convert
    zstd -d < vm-1111-disk-0-20260103.raw.zst | \
      qemu-img convert -f raw -O qcow2 /dev/stdin /tmp/vm-1111.qcow2
    
    # Create VM
    qm create 1111 --name charliehub-dr --memory 8192 --cores 4 \
      --scsi0 local-lvm:0,import-from=/tmp/vm-1111.qcow2
    qm start 1111
    

  2. From vzdump backups:

    # List available backups
    ls /mnt/pve/pikvm-backup/dump/vzdump-qemu-1111-*.vma.zst
    
    # Restore
    qmrestore /mnt/pve/pikvm-backup/dump/vzdump-qemu-1111-latest.vma.zst 1111 \
      --storage local-lvm
    qm start 1111
    

Service Failover

VM1111 (CharlieHub Services)

Affects: Traefik, Domain Manager, DDNS, Mail, SSH Bastion

Quick Recovery (Ceph snapshot):

# List available snapshots
rbd snap ls ceph-pool/vm-1111-disk-0

# Stop VM
qm stop 1111

# Rollback to snapshot
rbd snap rollback ceph-pool/vm-1111-disk-0@daily-20260103

# Start VM
qm start 1111

Full Recovery (vzdump):

# Find latest backup
ls -la /mnt/backup-storage/dump/vzdump-qemu-1111-*.vma.zst | tail -1

# Restore (destroys current VM)
qmrestore /mnt/backup-storage/dump/vzdump-qemu-1111-*.vma.zst 1111 \
  --storage ceph-pool --force
qm start 1111

CT1112 (PostgreSQL)

Affects: All database-dependent services

Quick Recovery:

# Check database integrity first
pct exec 1112 -- pg_isready
pct exec 1112 -- sudo -u postgres psql -c "SELECT 1"

# If corrupt, rollback
pct stop 1112
rbd snap rollback ceph-pool/vm-1112-disk-0@daily-20260103
pct start 1112

CT1113 (IoT Platform)

Affects: MQTT, Home Assistant, IoT integrations

Recovery:

# Same pattern as CT1112
pct stop 1113
rbd snap rollback ceph-pool/vm-1113-disk-0@daily-20260103
pct start 1113

Monitoring

Primary homelab monitoring: prod-monitoring (CT1115) on px1 Cloud monitoring: hub2 (OVH Dedicated Server)

homelab-monitor (CT1115) Recovery

# From px1
pct exec 1115 -- bash -c "cd /opt/monitoring && docker compose restart"

# Verify
curl http://REDACTED_IP5:9090/-/healthy
curl http://REDACTED_IP5:3000/api/health

hub1 Monitoring Recovery

# On hub1
cd /opt/charliehub && docker compose restart prometheus grafana

# Verify
curl https://prometheus.charliehub.net/-/healthy
curl https://grafana.charliehub.net/api/health

Network Failover

Site-to-Site VPN Down

UK (10.44.x.x) and France (10.35.x.x) connected via UniFi SD-WAN.

Diagnosis:

# From px1
ping REDACTED_IP  # px5

Fallback: Use WireGuard VPN via hub2

# From hub2, check WireGuard status
sudo wg show

# SSH via WireGuard
ssh px5  # px5 via WireGuard (REDACTED_IP)

Internet Outage (UK Site)

Impact: External access to UK services

Notes: - Internal services continue on 10.44.x.x - hub2 can still reach FR site via WireGuard - Use hub2 as jump host for emergency remote access

Post-Failover Checklist

# 1. Check cluster quorum
pvecm status

# 2. Check Ceph health
ceph -s

# 3. Verify HA status
ha-manager status

# 4. Check all VMs running
qm list
pct list

# 5. Check critical services (hub1)
curl -s https://domains.charliehub.net/health     # Domain Manager
curl -s https://prometheus.charliehub.net/-/healthy  # Prometheus

# 6. Verify backups will run
ls /etc/cron.d/ceph-*

HA-Managed VMs

These VMs auto-failover via Proxmox HA:

VM/CT Name Storage Auto-Failover
VM1111 charliehub-hub ceph-pool ✅ Yes
CT1112 prod-database ceph-pool ✅ Yes
CT1113 prod-iot-platform ceph-pool ✅ Yes

Documentation References