Skip to content

Ceph Storage Guide

This guide documents the Ceph cluster configuration, operational procedures, and hardening measures.

Current Status

Fully Operational — 5 OSDs, 3x Replication

All OSDs active, pool at size=3, HA configured, live migration working.

Component Status
OSD.0 (px3, Samsung 870 EVO 2TB) ✅ Active
OSD.1 (px2, Samsung 870 EVO 2TB) ✅ Active
OSD.2 (px1, Samsung 870 EVO 2TB) ✅ Active
OSD.3 (px1, NVMe ~1.7TB) ✅ Active
OSD.4 (px2, NVMe ~1.8TB) ✅ Active
Pool replication 3x (size=3, min_size=2)
Monitors 3 (px1, px2, px3)
Managers px2 (active), px1 (standby)
All VMs on Ceph ✅ Complete
HA configured ✅ Complete
Live migration ✅ Tested working
Monitoring ✅ Grafana + Prometheus

Architecture

px1-silverstone (i3-12100T, 8 cores, 32GB)
├── nvme0n1: OS/LVM (1.8TB NVMe)
├── sda: Ceph OSD.2 (Samsung 870 EVO 2TB, SATA)
│         └── OSD.3 also on NVMe partition (~1.7TB)
└── sdb: backup-storage (USB)

px2-monza (i3-8100T, 4 cores, 32GB)
├── nvme0n1: OS/LVM (1.8TB NVMe)
│         └── OSD.4 also on NVMe partition (~1.8TB)
└── sda: Ceph OSD.1 (Samsung 870 EVO 2TB, SATA)

px3-suzuka (i3-7100, 4 cores, 32GB)
├── sda: OS/LVM (120GB SATA SSD)
├── sdb: REMOVED (was WD Green HDD, SSD replacement on order)
├── sdc: Ceph OSD.0 (Samsung 870 EVO 2TB, SATA) — hardened
└── sdd: NAS storage (USB, temporary primary)

Ceph Cluster:
├── Monitors: px1, px2, px3 (quorum of 3)
├── Managers: px2 (active), px1 (standby)
├── OSDs: 5 active (2 on px1, 2 on px2, 1 on px3)
├── Pool: ceph-pool (size=3, min_size=2)
├── CRUSH rule: replicated_rule (chooseleaf_firstn, type host)
├── Raw capacity: 8.9 TiB
├── Used: ~1.5 TiB (17%)
└── Available: ~2.0 TiB usable (after 3x replication)

OSD Distribution

OSD Node Device Type Weight PGs
osd.0 px3-suzuka sdc Samsung 870 EVO 2TB (SATA) 1.81940 65
osd.1 px2-monza sda Samsung 870 EVO 2TB (SATA) 1.70509 31
osd.2 px1-silverstone sda Samsung 870 EVO 2TB (SATA) 1.81940 34
osd.3 px1-silverstone nvme NVMe ~1.7TB 1.70509 31
osd.4 px2-monza nvme NVMe ~1.8TB 1.81940 34

OSD.0 Balance

osd.0 on px3 carries all 65 PGs because it is the only OSD on its host. CRUSH places one replica per host, so px3 holds a full copy of all data. px1 and px2 each split their replicas across 2 OSDs.


OSD.0 Hardening (px3)

Applied 2026-02-07 after a suicide timeout incident caused by I/O contention.

Root Cause

The WD20EADS Caviar Green HDD (SATA 1.5 Gb/s, ~15 years old) on px3 shared the SATA controller with the Ceph OSD SSD. Backup jobs (especially nas-mirror-sync.sh at 03:00) saturated the controller, starving osd.0 of I/O. The OSD hit a suicide timeout and self-terminated. systemd's default restart policy (3 attempts in 30 min) was exhausted, leaving osd.0 down for ~10.5 hours.

Fixes Applied

1. systemd Restart Policy/etc/systemd/system/ceph-osd@.service.d/restart-policy.conf:

[Service]
StartLimitIntervalSec=0
StartLimitBurst=0
RestartSec=20

2. Recovery Throttling (per-OSD, osd.0 only):

ceph config set osd.0 osd_recovery_max_active 1
ceph config set osd.0 osd_max_backfills 1
ceph config set osd.0 osd_recovery_sleep_ssd 0.1

3. Backup I/O Priority — all NAS cron jobs wrapped with ionice -c2 -n7 nice -n 19

4. Drive Replacement — WD Green removed from service. NAS temporarily served from USB drive (sdd). Samsung 870 EVO 2TB SSD ordered as replacement.

Post SSD-Replacement

After the new SSD is installed and verified stable for one week:

# Consider relaxing recovery throttling
ceph config rm osd.0 osd_recovery_max_active
ceph config rm osd.0 osd_recovery_sleep_ssd
# Keep osd_max_backfills=1 as a permanent guardrail

HA Configuration

HA-Managed Services

Resource Service Node Storage
ct:1112 PostgreSQL (prod) px1 ceph-pool
ct:1113 IoT Platform (prod) px1 ceph-pool
ct:1118 ISP Monitor (STOPPED - migrated to Mint) px1 ceph-pool
ct:1935 Pescle Rodent px1 ceph-pool
ct:1945 Zoho Books API px1 ceph-pool
vm:1123 CBRE People Counting px1 ceph-pool
ct:2912 CT2912 px2 ceph-pool
ct:2913 Difenn Sprint1 px2 ceph-pool
ct:2920 Trevarn Core px2 ceph-pool
ct:2929 Trevarn Brand px2 ceph-pool
ct:3102 Homelab Monitor px3 ceph-pool
vm:3970 RPA Autoparts Store px3 ceph-pool

Critical Notes

CPU Compatibility for Live Migration

px1 (12th Gen) and px3 (7th Gen) have different CPUs. VMs must use x86-64-v2-AES CPU model for cross-node migration.


Recovery Tuning

# Speed up recovery (during maintenance windows only)
ceph config set osd osd_max_backfills 3
ceph config set osd osd_recovery_max_active 5

# Reset to defaults after recovery
ceph config rm osd osd_max_backfills
ceph config rm osd osd_recovery_max_active

# Note: osd.0 has per-OSD overrides — these global settings
# will NOT affect osd.0 (its per-OSD config takes precedence)

Monitoring

Grafana Dashboard

URL: https://grafana.charliehub.net (Dashboard 2842)

Prometheus Alerts

Alert Severity Trigger
CephHealthWarning warning health_status == 1 for 5m
CephHealthError critical health_status == 2 for 1m
CephOSDDown critical Any OSD down for 1m
CephOSDNearFull warning OSD > 75% full
CephOSDFull critical OSD > 85% full
CephPGsDegraded warning Degraded PGs for 10m
CephMonitorDown warning Monitor count < 3

Quick Commands

# Cluster status
ceph -s

# OSD tree and balance
ceph osd tree
ceph osd df

# Detailed health
ceph health detail

# OSD performance (check latency)
ceph osd perf

# Check for heartbeat issues
ceph health detail | grep -i heartbeat

# Watch recovery progress
watch ceph -s

Troubleshooting

OSD Suicide Timeout

If osd.0 crashes with ceph_abort_msg(hit suicide timeout):

  1. Check if backup jobs were running: ps aux | grep -E "rsync|vzdump"
  2. OSD should auto-restart (systemd policy fixes applied 2026-02-07)
  3. Monitor recovery: watch ceph -s
  4. If repeated: check journalctl -u ceph-osd@0 -f for I/O stalls

OSD Down After Reboot

ceph osd tree          # Check which OSDs are down
systemctl status ceph-osd@<id>
systemctl restart ceph-osd@<id>

Migration Fails with CPU Error

Set VM CPU to x86-64-v2-AES:

qm set <vmid> --cpu x86-64-v2-AES
qm shutdown <vmid> && qm start <vmid>

Slow Recovery

ceph config get osd osd_max_backfills
ceph config set osd osd_max_backfills 3
ceph config set osd osd_recovery_max_active 5

Incident Log

Date Event Impact Resolution
2026-01-19 osd.1 (px2) crashed 3x Temporary degradation Auto-recovered
2026-01-19 osd.3 (px1) crashed Temporary degradation Auto-recovered
2026-02-04 osd.3 (px1) crashed Temporary degradation Auto-recovered
2026-02-07 osd.0 (px3) suicide timeout 10.5h degradation (17% objects) Drive removed, hardening applied

Migration History

Date Action
2025-12-31 Initialized Ceph cluster, created monitors
2025-12-31 Created OSD.0 on px3, OSD.1 on px1
2025-12-31 Migrated all VMs to Ceph, configured HA
2026-01-01 Created OSD.2 on px1, upgraded pool to size=3
2026-01-xx Created OSD.3 (px1 NVMe), OSD.4 (px2 NVMe)
2026-02-07 osd.0 incident — WD Green removed, hardening applied

Last updated: 2026-02-07