Skip to content

Backup & Recovery

Comprehensive backup schedules and recovery procedures for the CharlieHub cluster.

Last Verified: 2026-02-04


Backup Architecture

3-2-1 Strategy

  • 3 copies: Ceph (3x replication) + PBS (France) + vzdump (UK)
  • 2 media types: Local NAS (UK) + PBS/NFS (France)
  • 1 off-site: France site (PBS on pikvm-backup)

Backup Methods

Method Storage Purpose Transfer Size
PBS (Primary) pbs-fr (France) Incremental off-site 1-5 GB/night
Vzdump (UK) px3-nas Fast UK restore Full backup
Ceph Replication ceph-pool Live redundancy Automatic

PBS vs Vzdump

PBS uses incremental deduplication - only changed data is transferred. After the initial full backup, nightly transfers drop from 40-100GB to 1-5GB.

Backup Layers

Layer Method Location Recovery Time Purpose
Ceph Replication Automatic 3x UK nodes Instant Live redundancy
PBS Incremental Daily 22:00-03:00 pbs-fr (France) 10-20 min Primary off-site
UK Secondary Daily 05:30 px3-nas (NFS) 5-10 min Fast UK restore
PBS Weekly Sunday 07:00 pbs-fr 10-20 min Long-term archive

Daily Schedule

All times in UTC. Node-staggered to prevent I/O contention.

PBS Jobs (Primary - Incremental)

Time Job Node Storage Retention
22:00 pbs-px3-daily px3 pbs-fr 7 daily, 4 weekly, 2 monthly
00:30 pbs-px2-daily px2 pbs-fr 7 daily, 4 weekly, 2 monthly
03:00 pbs-px1-daily px1 pbs-fr 7 daily, 4 weekly, 2 monthly

Vzdump Jobs (UK Local)

Time Job Node Storage Retention
05:30 uk-secondary px2, px3 px3-nas 5 daily, 2 weekly

Weekly Schedule

Day Time Job Storage Retention
Sunday 07:00 pbs-weekly pbs-fr 8 weekly, 3 monthly

Schedule Rationale

The backup schedule is node-staggered to prevent Ceph I/O contention:

22:00-00:00  px3-suzuka window (lightest workload, starts first)
00:30-02:30  px2-monza window
03:00-05:00  px1-silverstone window (production, last)
05:30-06:00  UK secondary backup (after PBS completes)

Proxmox Backup Server (PBS)

PBS runs on CT 5101 (pbs-fr) in France, providing incremental backups with deduplication.

PBS Details

Parameter Value
Container CT 5101 on px5-lemans
IP 10.35.1.101
Web UI https://10.35.1.101:8007
Datastore pbs-main (on pikvm-backup NFS)
PVE Storage pbs-fr

PBS Benefits

Metric Before (vzdump NFS) After (PBS)
Nightly WAN transfer 40-100 GB 1-5 GB
Storage used ~1.5 TB ~500 GB
Restore time (France) 30-60 min 10-20 min
Resume on failure No Yes

For full PBS documentation, see PBS Service.


Ceph Scrub Window

Ceph scrubs are restricted to 09:00-17:00 UTC (business hours) to avoid overlap with overnight backups.

# Current scrub settings
osd_scrub_begin_hour = 9
osd_scrub_end_hour = 17
osd_max_scrubs = 1
osd_scrub_load_threshold = 0.3

Vzdump Settings

Global vzdump settings in /etc/vzdump.conf (all nodes):

bwlimit: 80000     # 80 MB/s max bandwidth
ionice: 7          # Lowest I/O priority
pigz: 2            # 2 compression threads

Storage Locations

Storage Type Location Capacity Purpose
pbs-fr PBS CT 5101 (France) ~1.1 TB free Primary off-site
px3-nas NFS 10.44.1.30 (UK) 1.8 TB Fast UK restore
ceph-pool RBD 5 OSDs UK 8.9 TB Live VM storage

Recovery Procedures

# List available PBS backups
pvesm list pbs-fr --content backup | grep 1112

# Restore CT from PBS
pct restore 1112 pbs-fr:backup/ct/1112/2026-02-04T03:00:00Z --storage ceph-pool

# Restore VM from PBS
qmrestore pbs-fr:backup/vm/1111/2026-02-04T03:00:00Z 1111 --storage ceph-pool

# Restore to different VMID (test)
pct restore 9999 pbs-fr:backup/ct/1112/2026-02-04T03:00:00Z --storage ceph-pool --unique

Recovery time: 10-20 minutes

Restore from UK Secondary (Fastest)

# List available backups
ls -lh /mnt/pve/px3-nas/dump/ | grep 2912

# Restore
pct restore 2912 /mnt/pve/px3-nas/dump/vzdump-lxc-2912-*.tar.zst --storage ceph-pool

Recovery time: 5-10 minutes

File-Level Restore (PBS Only)

PBS supports individual file restore without full VM recovery:

  1. Open PBS Web UI (https://10.35.1.101:8007)
  2. Navigate to Datastore > pbs-main > Content
  3. Select backup snapshot
  4. Click "Browse Files"
  5. Download individual files

Monitoring & Troubleshooting

Check Backup Status

# Check PBS storage
pvesm status | grep pbs-fr

# List PBS backups
pvesm list pbs-fr --content backup | tail -10

# Check vzdump logs
tail -100 /var/log/vzdump/vzdump-*.log

Common Issues

PBS unreachable:

# Check PBS container
ssh px5 "pct status 5101"

# Check PBS service
ssh px5 "pct exec 5101 -- systemctl status proxmox-backup-proxy"

# UK backups (px3-nas) continue regardless

Ceph slow ops during backups:

# Check current slow ops
ceph health detail | grep slow

# Emergency: pause scrubs
ceph osd set noscrub
ceph osd set nodeep-scrub


Critical VMs

These VMs have all protection layers:

VMID Name PBS (France) UK Secondary
1112 prod-database
1113 prod-iot-platform
1118 isp-monitor (STOPPED - migrated to Mint)
3102 homelab-monitor

Retention Summary

Storage Daily Weekly Monthly
pbs-fr (PBS) 7 4 2
px3-nas 5 2 -

Legacy Jobs (Disabled)

These vzdump-to-NFS jobs were replaced by PBS:

Job Replaced By Status
pikvm-px1 pbs-px1-daily Parallel run
pikvm-px2 pbs-px2-daily Parallel run
pikvm-px3 pbs-px3-daily Parallel run
backup-55292acc pbs-px1-daily Parallel run
weekly-archive pbs-weekly Parallel run

Parallel Run Period

During Feb 4-18, 2026, both old vzdump and new PBS jobs run simultaneously for validation. After successful validation, old vzdump jobs will be disabled.


Hub2 Backups

hub2 (central dedicated server) has a separate backup system using rsync over WireGuard VPN.

Time Target Path Retention
03:00 UTC px3 (UK) /mnt/nas-backup/hub2-snapshots/ 7 daily
03:00 UTC px5 (FR) /mnt/pve/pikvm-backup/hub2-offsite/ 7 daily


Last updated: 2026-02-04