Ceph Storage Guide¶
This guide documents the Ceph cluster configuration, operational procedures, and hardening measures.
Current Status¶
Fully Operational — 5 OSDs, 3x Replication
All OSDs active, pool at size=3, HA configured, live migration working.
| Component | Status |
|---|---|
| OSD.0 (px3, Samsung 870 EVO 2TB) | ✅ Active |
| OSD.1 (px2, Samsung 870 EVO 2TB) | ✅ Active |
| OSD.2 (px1, Samsung 870 EVO 2TB) | ✅ Active |
| OSD.3 (px1, NVMe ~1.7TB) | ✅ Active |
| OSD.4 (px2, NVMe ~1.8TB) | ✅ Active |
| Pool replication | 3x (size=3, min_size=2) |
| Monitors | 3 (px1, px2, px3) |
| Managers | px2 (active), px1 (standby) |
| All VMs on Ceph | ✅ Complete |
| HA configured | ✅ Complete |
| Live migration | ✅ Tested working |
| Monitoring | ✅ Grafana + Prometheus |
Architecture¶
px1-silverstone (i3-12100T, 8 cores, 32GB)
├── nvme0n1: OS/LVM (1.8TB NVMe)
├── sda: Ceph OSD.2 (Samsung 870 EVO 2TB, SATA)
│ └── OSD.3 also on NVMe partition (~1.7TB)
└── sdb: backup-storage (USB)
px2-monza (i3-8100T, 4 cores, 32GB)
├── nvme0n1: OS/LVM (1.8TB NVMe)
│ └── OSD.4 also on NVMe partition (~1.8TB)
└── sda: Ceph OSD.1 (Samsung 870 EVO 2TB, SATA)
px3-suzuka (i3-7100, 4 cores, 32GB)
├── sda: OS/LVM (120GB SATA SSD)
├── sdb: REMOVED (was WD Green HDD, SSD replacement on order)
├── sdc: Ceph OSD.0 (Samsung 870 EVO 2TB, SATA) — hardened
└── sdd: NAS storage (USB, temporary primary)
Ceph Cluster:
├── Monitors: px1, px2, px3 (quorum of 3)
├── Managers: px2 (active), px1 (standby)
├── OSDs: 5 active (2 on px1, 2 on px2, 1 on px3)
├── Pool: ceph-pool (size=3, min_size=2)
├── CRUSH rule: replicated_rule (chooseleaf_firstn, type host)
├── Raw capacity: 8.9 TiB
├── Used: ~1.5 TiB (17%)
└── Available: ~2.0 TiB usable (after 3x replication)
OSD Distribution¶
| OSD | Node | Device | Type | Weight | PGs |
|---|---|---|---|---|---|
| osd.0 | px3-suzuka | sdc | Samsung 870 EVO 2TB (SATA) | 1.81940 | 65 |
| osd.1 | px2-monza | sda | Samsung 870 EVO 2TB (SATA) | 1.70509 | 31 |
| osd.2 | px1-silverstone | sda | Samsung 870 EVO 2TB (SATA) | 1.81940 | 34 |
| osd.3 | px1-silverstone | nvme | NVMe ~1.7TB | 1.70509 | 31 |
| osd.4 | px2-monza | nvme | NVMe ~1.8TB | 1.81940 | 34 |
OSD.0 Balance
osd.0 on px3 carries all 65 PGs because it is the only OSD on its host. CRUSH places one replica per host, so px3 holds a full copy of all data. px1 and px2 each split their replicas across 2 OSDs.
OSD.0 Hardening (px3)¶
Applied 2026-02-07 after a suicide timeout incident caused by I/O contention.
Root Cause¶
The WD20EADS Caviar Green HDD (SATA 1.5 Gb/s, ~15 years old) on px3 shared the SATA controller with the Ceph OSD SSD. Backup jobs (especially nas-mirror-sync.sh at 03:00) saturated the controller, starving osd.0 of I/O. The OSD hit a suicide timeout and self-terminated. systemd's default restart policy (3 attempts in 30 min) was exhausted, leaving osd.0 down for ~10.5 hours.
Fixes Applied¶
1. systemd Restart Policy — /etc/systemd/system/ceph-osd@.service.d/restart-policy.conf:
[Service]
StartLimitIntervalSec=0
StartLimitBurst=0
RestartSec=20
2. Recovery Throttling (per-OSD, osd.0 only):
ceph config set osd.0 osd_recovery_max_active 1
ceph config set osd.0 osd_max_backfills 1
ceph config set osd.0 osd_recovery_sleep_ssd 0.1
3. Backup I/O Priority — all NAS cron jobs wrapped with ionice -c2 -n7 nice -n 19
4. Drive Replacement — WD Green removed from service. NAS temporarily served from USB drive (sdd). Samsung 870 EVO 2TB SSD ordered as replacement.
Post SSD-Replacement¶
After the new SSD is installed and verified stable for one week:
# Consider relaxing recovery throttling
ceph config rm osd.0 osd_recovery_max_active
ceph config rm osd.0 osd_recovery_sleep_ssd
# Keep osd_max_backfills=1 as a permanent guardrail
HA Configuration¶
HA-Managed Services¶
| Resource | Service | Node | Storage |
|---|---|---|---|
| ct:1112 | PostgreSQL (prod) | px1 | ceph-pool |
| ct:1113 | IoT Platform (prod) | px1 | ceph-pool |
| ct:1118 | ISP Monitor (STOPPED - migrated to Mint) | px1 | ceph-pool |
| ct:1935 | Pescle Rodent | px1 | ceph-pool |
| ct:1945 | Zoho Books API | px1 | ceph-pool |
| vm:1123 | CBRE People Counting | px1 | ceph-pool |
| ct:2912 | CT2912 | px2 | ceph-pool |
| ct:2913 | Difenn Sprint1 | px2 | ceph-pool |
| ct:2920 | Trevarn Core | px2 | ceph-pool |
| ct:2929 | Trevarn Brand | px2 | ceph-pool |
| ct:3102 | Homelab Monitor | px3 | ceph-pool |
| vm:3970 | RPA Autoparts Store | px3 | ceph-pool |
Critical Notes¶
CPU Compatibility for Live Migration
px1 (12th Gen) and px3 (7th Gen) have different CPUs. VMs must use x86-64-v2-AES CPU model for cross-node migration.
Recovery Tuning¶
# Speed up recovery (during maintenance windows only)
ceph config set osd osd_max_backfills 3
ceph config set osd osd_recovery_max_active 5
# Reset to defaults after recovery
ceph config rm osd osd_max_backfills
ceph config rm osd osd_recovery_max_active
# Note: osd.0 has per-OSD overrides — these global settings
# will NOT affect osd.0 (its per-OSD config takes precedence)
Monitoring¶
Grafana Dashboard¶
URL: https://grafana.charliehub.net (Dashboard 2842)
Prometheus Alerts¶
| Alert | Severity | Trigger |
|---|---|---|
| CephHealthWarning | warning | health_status == 1 for 5m |
| CephHealthError | critical | health_status == 2 for 1m |
| CephOSDDown | critical | Any OSD down for 1m |
| CephOSDNearFull | warning | OSD > 75% full |
| CephOSDFull | critical | OSD > 85% full |
| CephPGsDegraded | warning | Degraded PGs for 10m |
| CephMonitorDown | warning | Monitor count < 3 |
Quick Commands¶
# Cluster status
ceph -s
# OSD tree and balance
ceph osd tree
ceph osd df
# Detailed health
ceph health detail
# OSD performance (check latency)
ceph osd perf
# Check for heartbeat issues
ceph health detail | grep -i heartbeat
# Watch recovery progress
watch ceph -s
Troubleshooting¶
OSD Suicide Timeout¶
If osd.0 crashes with ceph_abort_msg(hit suicide timeout):
- Check if backup jobs were running:
ps aux | grep -E "rsync|vzdump" - OSD should auto-restart (systemd policy fixes applied 2026-02-07)
- Monitor recovery:
watch ceph -s - If repeated: check
journalctl -u ceph-osd@0 -ffor I/O stalls
OSD Down After Reboot¶
ceph osd tree # Check which OSDs are down
systemctl status ceph-osd@<id>
systemctl restart ceph-osd@<id>
Migration Fails with CPU Error¶
Set VM CPU to x86-64-v2-AES:
qm set <vmid> --cpu x86-64-v2-AES
qm shutdown <vmid> && qm start <vmid>
Slow Recovery¶
ceph config get osd osd_max_backfills
ceph config set osd osd_max_backfills 3
ceph config set osd osd_recovery_max_active 5
Incident Log¶
| Date | Event | Impact | Resolution |
|---|---|---|---|
| 2026-01-19 | osd.1 (px2) crashed 3x | Temporary degradation | Auto-recovered |
| 2026-01-19 | osd.3 (px1) crashed | Temporary degradation | Auto-recovered |
| 2026-02-04 | osd.3 (px1) crashed | Temporary degradation | Auto-recovered |
| 2026-02-07 | osd.0 (px3) suicide timeout | 10.5h degradation (17% objects) | Drive removed, hardening applied |
Migration History¶
| Date | Action |
|---|---|
| 2025-12-31 | Initialized Ceph cluster, created monitors |
| 2025-12-31 | Created OSD.0 on px3, OSD.1 on px1 |
| 2025-12-31 | Migrated all VMs to Ceph, configured HA |
| 2026-01-01 | Created OSD.2 on px1, upgraded pool to size=3 |
| 2026-01-xx | Created OSD.3 (px1 NVMe), OSD.4 (px2 NVMe) |
| 2026-02-07 | osd.0 incident — WD Green removed, hardening applied |
Last updated: 2026-02-07