Monitoring Operations¶

Day-to-day monitoring operations for the CharlieHub infrastructure.

Monitoring Locations¶

System	Location	Purpose
homelab-monitor (CT3102)	px3-suzuka	Primary homelab monitoring
hub2	OVH Dedicated	Cloud/OVH monitoring

See Monitoring Service for full architecture details.

Quick Access¶

homelab-monitor (CT3102)¶

Service	URL/Port	Notes
Grafana	http://REDACTED_IP:3000	Main dashboards
Prometheus	http://REDACTED_IP:9090	Metrics queries
Alertmanager	http://REDACTED_IP:9093	Alert management
Loki	http://REDACTED_IP:3100	Log queries
Homarr	http://REDACTED_IP:7575	Service dashboard
Pulse	http://REDACTED_IP:7655	Proxmox dashboard

hub1¶

Service	URL	Notes
Grafana	https://grafana.charliehub.net	Basic dashboards
Prometheus	https://prometheus.charliehub.net	OVH metrics

Daily Health Checks¶

Check homelab-monitor (CT3102)¶

# From px3
ssh root@REDACTED_IP

# Check all monitoring containers are running
pct exec 3102 -- docker ps --format "table {{.Names}}\t{{.Status}}"

# Check Prometheus target health
pct exec 3102 -- curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Check for any down targets
pct exec 3102 -- curl -s http://localhost:9090/api/v1/targets | jq '[.data.activeTargets[] | select(.health != "up")] | length'

Check hub1¶

# On hub1
docker ps | grep -E "prometheus|grafana"

# Check Prometheus health
curl -s http://localhost:9090/-/healthy

# Check Grafana health
curl -s http://localhost:3000/api/health

Common Prometheus Queries¶

CPU Usage¶

# CPU usage by node
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Top 5 CPU consumers
topk(5, 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))

Memory Usage¶

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Available memory in GB
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024

Disk Usage¶

# Disk usage percentage
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100

# Free disk space in GB
node_filesystem_free_bytes / 1024 / 1024 / 1024

Ceph Health (CT3102 only)¶

# Ceph cluster health status
ceph_health_status

# OSD usage
ceph_osd_stat_bytes_used / ceph_osd_stat_bytes * 100

Log Queries (Loki - CT3102 only)¶

Access Loki via Grafana Explore or direct API.

# Recent errors across all containers
{job="docker"} |= "error" | limit 100

# Specific container logs
{container_name="prometheus"} | limit 50

# Logs with JSON parsing
{job="docker"} | json | level="error"

Troubleshooting¶

Prometheus Target Down¶

# Check if target host is reachable
ping <target_ip>

# Check if node_exporter is running on target
ssh root@<target_ip> systemctl status node_exporter

# Check firewall
ssh root@<target_ip> ss -tlnp | grep 9100

Container Issues (CT3102)¶

# View container logs
pct exec 3102 -- docker logs prometheus --tail 100
pct exec 3102 -- docker logs grafana --tail 100
pct exec 3102 -- docker logs loki --tail 100

# Restart specific container
pct exec 3102 -- docker restart prometheus

# Restart entire stack
pct exec 3102 -- bash -c "cd /opt/monitoring && docker compose restart"

# Rebuild and restart
pct exec 3102 -- bash -c "cd /opt/monitoring && docker compose up -d --force-recreate"

Container Issues (hub1)¶

# View logs
docker logs charliehub_prometheus --tail 100
docker logs charliehub_grafana --tail 100

# Restart
cd /opt/charliehub && docker compose restart prometheus grafana

Disk Full on Prometheus¶

# Check Prometheus data directory size
# On CT3102
pct exec 3102 -- du -sh /opt/monitoring/prometheus/data/

# On hub1
du -sh /opt/charliehub/monitoring/prometheus/data/

# If too large, adjust retention in prometheus.yml:
# --storage.tsdb.retention.time=15d
# --storage.tsdb.retention.size=5GB

Alertmanager Not Sending Alerts (CT3102)¶

# Check Alertmanager status
pct exec 3102 -- curl -s http://localhost:9093/api/v1/status

# Check active alerts
pct exec 3102 -- curl -s http://localhost:9093/api/v1/alerts

# View Alertmanager logs
pct exec 3102 -- docker logs alertmanager --tail 50

Adding New Monitoring Targets¶

Add target to CT3102¶

Edit Prometheus config:

pct exec 3102 -- vi /opt/monitoring/prometheus/prometheus.yml

Add new target:

- job_name: 'new-target'
  static_configs:
    - targets: ['10.44.1.xxx:9100']
      labels:
        instance: 'new-host'

Reload Prometheus:

pct exec 3102 -- curl -X POST http://localhost:9090/-/reload

Add target to hub1¶

Edit Prometheus config:

vi /opt/charliehub/monitoring/prometheus/prometheus.yml

Reload:

curl -X POST http://localhost:9090/-/reload

Grafana Dashboard Management¶

Import Dashboard (CT3102)¶

Access Grafana at http://REDACTED_IP:3000
Go to Dashboards > Import
Enter dashboard ID from grafana.com or paste JSON
Select Prometheus data source

Useful Dashboard IDs¶

Dashboard	ID	Description
Node Exporter Full	1860	Comprehensive host metrics
Docker Container	893	Container metrics
Loki Dashboard	13639	Log analysis
Ceph Cluster	2842	Ceph monitoring
UniFi-Poller	11315	Unifi network metrics

Monitoring Service - Full architecture and configuration
Troubleshooting - General troubleshooting guide
Daily Tasks - Daily operations checklist

Ceph Alert Rules (hub2)¶

Prometheus alert rules for Ceph cluster health, deployed to /opt/charliehub/monitoring/prometheus/rules/ceph-alerts.yml.

Alert Escalation¶

Alert	Duration	Severity	Action
CephSlowOps	5 min	warning	Investigate immediately
CephSlowOpsEscalated	15 min	high	Consider pausing backups/scrubs
CephSlowOpsCritical	30 min	critical	Intervention required NOW

Other Ceph Alerts¶

Alert	Condition	Severity
CephOSDDown	Any OSD down for 1 min	critical
CephMonitorDown	Monitor out of quorum	critical
CephPGDegraded	PGs degraded for 5 min	warning
CephHealthWarn	Health WARN for 15 min	warning
CephOSDHighLatency	Apply latency > 100ms for 10 min	warning

Managing Ceph Alerts¶

# View current alerts on hub2
ssh ubuntu@51.68.235.106 "sudo docker compose -f /opt/charliehub/docker-compose.yml exec prometheus wget -qO- http://localhost:9090/api/v1/alerts"

# Check rule file
ssh ubuntu@51.68.235.106 "cat /opt/charliehub/monitoring/prometheus/rules/ceph-alerts.yml"

# Reload after changes
ssh ubuntu@51.68.235.106 "sudo docker compose -f /opt/charliehub/docker-compose.yml restart prometheus"

Responding to Ceph Slow Ops¶

If CephSlowOps fires:

Check Ceph health:
```
ceph health detail
ceph osd perf
```

Check for concurrent I/O:

# Backup jobs running?
ps aux | grep vzdump

# Scrubs running?
ceph pg dump | grep scrub

Emergency mitigation:

# Pause scrubs
ceph osd set noscrub
ceph osd set nodeep-scrub

# After issue resolved
ceph osd unset noscrub
ceph osd unset nodeep-scrub

Last updated: 2026-02-04