Ceph Storage Guide¶

This guide documents the Ceph cluster configuration, operational procedures, and hardening measures.

Current Status¶

Fully Operational — 5 OSDs, 3x Replication

All OSDs active, pool at size=3, HA configured, live migration working.

Component	Status
OSD.0 (px3, Samsung 870 EVO 2TB)	✅ Active
OSD.1 (px2, Samsung 870 EVO 2TB)	✅ Active
OSD.2 (px1, Samsung 870 EVO 2TB)	✅ Active
OSD.3 (px1, NVMe ~1.7TB)	✅ Active
OSD.4 (px2, NVMe ~1.8TB)	✅ Active
Pool replication	3x (size=3, min_size=2)
Monitors	3 (px1, px2, px3)
Managers	px2 (active), px1 (standby)
All VMs on Ceph	✅ Complete
HA configured	✅ Complete
Live migration	✅ Tested working
Monitoring	✅ Grafana + Prometheus

Architecture¶

px1-silverstone (i3-12100T, 8 cores, 32GB)
├── nvme0n1: OS/LVM (1.8TB NVMe)
├── sda: Ceph OSD.2 (Samsung 870 EVO 2TB, SATA)
│         └── OSD.3 also on NVMe partition (~1.7TB)
└── sdb: backup-storage (USB)

px2-monza (i3-8100T, 4 cores, 32GB)
├── nvme0n1: OS/LVM (1.8TB NVMe)
│         └── OSD.4 also on NVMe partition (~1.8TB)
└── sda: Ceph OSD.1 (Samsung 870 EVO 2TB, SATA)

px3-suzuka (i3-7100, 4 cores, 32GB)
├── sda: OS/LVM (120GB SATA SSD)
├── sdb: REMOVED (was WD Green HDD, SSD replacement on order)
├── sdc: Ceph OSD.0 (Samsung 870 EVO 2TB, SATA) — hardened
└── sdd: NAS storage (USB, temporary primary)

Ceph Cluster:
├── Monitors: px1, px2, px3 (quorum of 3)
├── Managers: px2 (active), px1 (standby)
├── OSDs: 5 active (2 on px1, 2 on px2, 1 on px3)
├── Pool: ceph-pool (size=3, min_size=2)
├── CRUSH rule: replicated_rule (chooseleaf_firstn, type host)
├── Raw capacity: 8.9 TiB
├── Used: ~1.5 TiB (17%)
└── Available: ~2.0 TiB usable (after 3x replication)

OSD Distribution¶

OSD	Node	Device	Type	Weight	PGs
osd.0	px3-suzuka	sdc	Samsung 870 EVO 2TB (SATA)	1.81940	65
osd.1	px2-monza	sda	Samsung 870 EVO 2TB (SATA)	1.70509	31
osd.2	px1-silverstone	sda	Samsung 870 EVO 2TB (SATA)	1.81940	34
osd.3	px1-silverstone	nvme	NVMe ~1.7TB	1.70509	31
osd.4	px2-monza	nvme	NVMe ~1.8TB	1.81940	34

OSD.0 Balance

osd.0 on px3 carries all 65 PGs because it is the only OSD on its host. CRUSH places one replica per host, so px3 holds a full copy of all data. px1 and px2 each split their replicas across 2 OSDs.

OSD.0 Hardening (px3)¶

Applied 2026-02-07 after a suicide timeout incident caused by I/O contention.

Root Cause¶

The WD20EADS Caviar Green HDD (SATA 1.5 Gb/s, ~15 years old) on px3 shared the SATA controller with the Ceph OSD SSD. Backup jobs (especially nas-mirror-sync.sh at 03:00) saturated the controller, starving osd.0 of I/O. The OSD hit a suicide timeout and self-terminated. systemd's default restart policy (3 attempts in 30 min) was exhausted, leaving osd.0 down for ~10.5 hours.

Fixes Applied¶

1. systemd Restart Policy — /etc/systemd/system/ceph-osd@.service.d/restart-policy.conf:

[Service]
StartLimitIntervalSec=0
StartLimitBurst=0
RestartSec=20

2. Recovery Throttling (per-OSD, osd.0 only):

ceph config set osd.0 osd_recovery_max_active 1
ceph config set osd.0 osd_max_backfills 1
ceph config set osd.0 osd_recovery_sleep_ssd 0.1

3. Backup I/O Priority — all NAS cron jobs wrapped with ionice -c2 -n7 nice -n 19

4. Drive Replacement — WD Green removed from service. NAS temporarily served from USB drive (sdd). Samsung 870 EVO 2TB SSD ordered as replacement.

Post SSD-Replacement¶

After the new SSD is installed and verified stable for one week:

# Consider relaxing recovery throttling
ceph config rm osd.0 osd_recovery_max_active
ceph config rm osd.0 osd_recovery_sleep_ssd
# Keep osd_max_backfills=1 as a permanent guardrail

HA Configuration¶

HA-Managed Services¶

Resource	Service	Node	Storage
ct:1112	PostgreSQL (prod)	px1	ceph-pool
ct:1113	IoT Platform (prod)	px1	ceph-pool
ct:1118	ISP Monitor (STOPPED - migrated to Mint)	px1	ceph-pool
ct:1935	Pescle Rodent	px1	ceph-pool
ct:1945	Zoho Books API	px1	ceph-pool
vm:1123	CBRE People Counting	px1	ceph-pool
ct:2912	CT2912	px2	ceph-pool
ct:2913	Difenn Sprint1	px2	ceph-pool
ct:2920	Trevarn Core	px2	ceph-pool
ct:2929	Trevarn Brand	px2	ceph-pool
ct:3102	Homelab Monitor	px3	ceph-pool
vm:3970	RPA Autoparts Store	px3	ceph-pool

Critical Notes¶

CPU Compatibility for Live Migration

px1 (12th Gen) and px3 (7th Gen) have different CPUs. VMs must use x86-64-v2-AES CPU model for cross-node migration.

Recovery Tuning¶

# Speed up recovery (during maintenance windows only)
ceph config set osd osd_max_backfills 3
ceph config set osd osd_recovery_max_active 5

# Reset to defaults after recovery
ceph config rm osd osd_max_backfills
ceph config rm osd osd_recovery_max_active

# Note: osd.0 has per-OSD overrides — these global settings
# will NOT affect osd.0 (its per-OSD config takes precedence)

Monitoring¶

Grafana Dashboard¶

URL: https://grafana.charliehub.net (Dashboard 2842)

Prometheus Alerts¶

Alert	Severity	Trigger
CephHealthWarning	warning	health_status == 1 for 5m
CephHealthError	critical	health_status == 2 for 1m
CephOSDDown	critical	Any OSD down for 1m
CephOSDNearFull	warning	OSD > 75% full
CephOSDFull	critical	OSD > 85% full
CephPGsDegraded	warning	Degraded PGs for 10m
CephMonitorDown	warning	Monitor count < 3

Quick Commands¶

# Cluster status
ceph -s

# OSD tree and balance
ceph osd tree
ceph osd df

# Detailed health
ceph health detail

# OSD performance (check latency)
ceph osd perf

# Check for heartbeat issues
ceph health detail | grep -i heartbeat

# Watch recovery progress
watch ceph -s

Troubleshooting¶

OSD Suicide Timeout¶

If osd.0 crashes with ceph_abort_msg(hit suicide timeout):

Check if backup jobs were running: ps aux | grep -E "rsync|vzdump"
OSD should auto-restart (systemd policy fixes applied 2026-02-07)
Monitor recovery: watch ceph -s
If repeated: check journalctl -u ceph-osd@0 -f for I/O stalls

OSD Down After Reboot¶

ceph osd tree          # Check which OSDs are down
systemctl status ceph-osd@<id>
systemctl restart ceph-osd@<id>

Migration Fails with CPU Error¶

Set VM CPU to x86-64-v2-AES:

qm set <vmid> --cpu x86-64-v2-AES
qm shutdown <vmid> && qm start <vmid>

Slow Recovery¶

ceph config get osd osd_max_backfills
ceph config set osd osd_max_backfills 3
ceph config set osd osd_recovery_max_active 5

Incident Log¶

Date	Event	Impact	Resolution
2026-01-19	osd.1 (px2) crashed 3x	Temporary degradation	Auto-recovered
2026-01-19	osd.3 (px1) crashed	Temporary degradation	Auto-recovered
2026-02-04	osd.3 (px1) crashed	Temporary degradation	Auto-recovered
2026-02-07	osd.0 (px3) suicide timeout	10.5h degradation (17% objects)	Drive removed, hardening applied

Migration History¶

Date	Action
2025-12-31	Initialized Ceph cluster, created monitors
2025-12-31	Created OSD.0 on px3, OSD.1 on px1
2025-12-31	Migrated all VMs to Ceph, configured HA
2026-01-01	Created OSD.2 on px1, upgraded pool to size=3
2026-01-xx	Created OSD.3 (px1 NVMe), OSD.4 (px2 NVMe)
2026-02-07	osd.0 incident — WD Green removed, hardening applied

Last updated: 2026-02-07