Monitoring Operations¶
Day-to-day monitoring operations for the CharlieHub infrastructure.
Monitoring Locations¶
| System | Location | Purpose |
|---|---|---|
| homelab-monitor (CT3102) | px3-suzuka | Primary homelab monitoring |
| hub2 | OVH Dedicated | Cloud/OVH monitoring |
See Monitoring Service for full architecture details.
Quick Access¶
homelab-monitor (CT3102)¶
| Service | URL/Port | Notes |
|---|---|---|
| Grafana | http://REDACTED_IP:3000 | Main dashboards |
| Prometheus | http://REDACTED_IP:9090 | Metrics queries |
| Alertmanager | http://REDACTED_IP:9093 | Alert management |
| Loki | http://REDACTED_IP:3100 | Log queries |
| Homarr | http://REDACTED_IP:7575 | Service dashboard |
| Pulse | http://REDACTED_IP:7655 | Proxmox dashboard |
hub1¶
| Service | URL | Notes |
|---|---|---|
| Grafana | https://grafana.charliehub.net | Basic dashboards |
| Prometheus | https://prometheus.charliehub.net | OVH metrics |
Daily Health Checks¶
Check homelab-monitor (CT3102)¶
# From px3
ssh root@REDACTED_IP
# Check all monitoring containers are running
pct exec 3102 -- docker ps --format "table {{.Names}}\t{{.Status}}"
# Check Prometheus target health
pct exec 3102 -- curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# Check for any down targets
pct exec 3102 -- curl -s http://localhost:9090/api/v1/targets | jq '[.data.activeTargets[] | select(.health != "up")] | length'
Check hub1¶
# On hub1
docker ps | grep -E "prometheus|grafana"
# Check Prometheus health
curl -s http://localhost:9090/-/healthy
# Check Grafana health
curl -s http://localhost:3000/api/health
Common Prometheus Queries¶
CPU Usage¶
# CPU usage by node
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Top 5 CPU consumers
topk(5, 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))
Memory Usage¶
# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
# Available memory in GB
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024
Disk Usage¶
# Disk usage percentage
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100
# Free disk space in GB
node_filesystem_free_bytes / 1024 / 1024 / 1024
Ceph Health (CT3102 only)¶
# Ceph cluster health status
ceph_health_status
# OSD usage
ceph_osd_stat_bytes_used / ceph_osd_stat_bytes * 100
Log Queries (Loki - CT3102 only)¶
Access Loki via Grafana Explore or direct API.
# Recent errors across all containers
{job="docker"} |= "error" | limit 100
# Specific container logs
{container_name="prometheus"} | limit 50
# Logs with JSON parsing
{job="docker"} | json | level="error"
Troubleshooting¶
Prometheus Target Down¶
# Check if target host is reachable
ping <target_ip>
# Check if node_exporter is running on target
ssh root@<target_ip> systemctl status node_exporter
# Check firewall
ssh root@<target_ip> ss -tlnp | grep 9100
Container Issues (CT3102)¶
# View container logs
pct exec 3102 -- docker logs prometheus --tail 100
pct exec 3102 -- docker logs grafana --tail 100
pct exec 3102 -- docker logs loki --tail 100
# Restart specific container
pct exec 3102 -- docker restart prometheus
# Restart entire stack
pct exec 3102 -- bash -c "cd /opt/monitoring && docker compose restart"
# Rebuild and restart
pct exec 3102 -- bash -c "cd /opt/monitoring && docker compose up -d --force-recreate"
Container Issues (hub1)¶
# View logs
docker logs charliehub_prometheus --tail 100
docker logs charliehub_grafana --tail 100
# Restart
cd /opt/charliehub && docker compose restart prometheus grafana
Disk Full on Prometheus¶
# Check Prometheus data directory size
# On CT3102
pct exec 3102 -- du -sh /opt/monitoring/prometheus/data/
# On hub1
du -sh /opt/charliehub/monitoring/prometheus/data/
# If too large, adjust retention in prometheus.yml:
# --storage.tsdb.retention.time=15d
# --storage.tsdb.retention.size=5GB
Alertmanager Not Sending Alerts (CT3102)¶
# Check Alertmanager status
pct exec 3102 -- curl -s http://localhost:9093/api/v1/status
# Check active alerts
pct exec 3102 -- curl -s http://localhost:9093/api/v1/alerts
# View Alertmanager logs
pct exec 3102 -- docker logs alertmanager --tail 50
Adding New Monitoring Targets¶
Add target to CT3102¶
-
Edit Prometheus config:
pct exec 3102 -- vi /opt/monitoring/prometheus/prometheus.yml -
Add new target:
- job_name: 'new-target' static_configs: - targets: ['10.44.1.xxx:9100'] labels: instance: 'new-host' -
Reload Prometheus:
pct exec 3102 -- curl -X POST http://localhost:9090/-/reload
Add target to hub1¶
-
Edit Prometheus config:
vi /opt/charliehub/monitoring/prometheus/prometheus.yml -
Reload:
curl -X POST http://localhost:9090/-/reload
Grafana Dashboard Management¶
Import Dashboard (CT3102)¶
- Access Grafana at http://REDACTED_IP:3000
- Go to Dashboards > Import
- Enter dashboard ID from grafana.com or paste JSON
- Select Prometheus data source
Useful Dashboard IDs¶
| Dashboard | ID | Description |
|---|---|---|
| Node Exporter Full | 1860 | Comprehensive host metrics |
| Docker Container | 893 | Container metrics |
| Loki Dashboard | 13639 | Log analysis |
| Ceph Cluster | 2842 | Ceph monitoring |
| UniFi-Poller | 11315 | Unifi network metrics |
Related Documentation¶
- Monitoring Service - Full architecture and configuration
- Troubleshooting - General troubleshooting guide
- Daily Tasks - Daily operations checklist
Ceph Alert Rules (hub2)¶
Prometheus alert rules for Ceph cluster health, deployed to /opt/charliehub/monitoring/prometheus/rules/ceph-alerts.yml.
Alert Escalation¶
| Alert | Duration | Severity | Action |
|---|---|---|---|
| CephSlowOps | 5 min | warning | Investigate immediately |
| CephSlowOpsEscalated | 15 min | high | Consider pausing backups/scrubs |
| CephSlowOpsCritical | 30 min | critical | Intervention required NOW |
Other Ceph Alerts¶
| Alert | Condition | Severity |
|---|---|---|
| CephOSDDown | Any OSD down for 1 min | critical |
| CephMonitorDown | Monitor out of quorum | critical |
| CephPGDegraded | PGs degraded for 5 min | warning |
| CephHealthWarn | Health WARN for 15 min | warning |
| CephOSDHighLatency | Apply latency > 100ms for 10 min | warning |
Managing Ceph Alerts¶
# View current alerts on hub2
ssh ubuntu@51.68.235.106 "sudo docker compose -f /opt/charliehub/docker-compose.yml exec prometheus wget -qO- http://localhost:9090/api/v1/alerts"
# Check rule file
ssh ubuntu@51.68.235.106 "cat /opt/charliehub/monitoring/prometheus/rules/ceph-alerts.yml"
# Reload after changes
ssh ubuntu@51.68.235.106 "sudo docker compose -f /opt/charliehub/docker-compose.yml restart prometheus"
Responding to Ceph Slow Ops¶
If CephSlowOps fires:
-
Check Ceph health:
ceph health detail ceph osd perf -
Check for concurrent I/O:
# Backup jobs running? ps aux | grep vzdump # Scrubs running? ceph pg dump | grep scrub -
Emergency mitigation:
# Pause scrubs ceph osd set noscrub ceph osd set nodeep-scrub # After issue resolved ceph osd unset noscrub ceph osd unset nodeep-scrub
Last updated: 2026-02-04