Skip to content

Monitoring Operations

Day-to-day monitoring operations for the CharlieHub infrastructure.

Monitoring Locations

System Location Purpose
homelab-monitor (CT3102) px3-suzuka Primary homelab monitoring
hub2 OVH Dedicated Cloud/OVH monitoring

See Monitoring Service for full architecture details.

Quick Access

homelab-monitor (CT3102)

Service URL/Port Notes
Grafana http://REDACTED_IP:3000 Main dashboards
Prometheus http://REDACTED_IP:9090 Metrics queries
Alertmanager http://REDACTED_IP:9093 Alert management
Loki http://REDACTED_IP:3100 Log queries
Homarr http://REDACTED_IP:7575 Service dashboard
Pulse http://REDACTED_IP:7655 Proxmox dashboard

hub1

Service URL Notes
Grafana https://grafana.charliehub.net Basic dashboards
Prometheus https://prometheus.charliehub.net OVH metrics

Daily Health Checks

Check homelab-monitor (CT3102)

# From px3
ssh root@REDACTED_IP

# Check all monitoring containers are running
pct exec 3102 -- docker ps --format "table {{.Names}}\t{{.Status}}"

# Check Prometheus target health
pct exec 3102 -- curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Check for any down targets
pct exec 3102 -- curl -s http://localhost:9090/api/v1/targets | jq '[.data.activeTargets[] | select(.health != "up")] | length'

Check hub1

# On hub1
docker ps | grep -E "prometheus|grafana"

# Check Prometheus health
curl -s http://localhost:9090/-/healthy

# Check Grafana health
curl -s http://localhost:3000/api/health

Common Prometheus Queries

CPU Usage

# CPU usage by node
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Top 5 CPU consumers
topk(5, 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))

Memory Usage

# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

# Available memory in GB
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024

Disk Usage

# Disk usage percentage
(node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100

# Free disk space in GB
node_filesystem_free_bytes / 1024 / 1024 / 1024

Ceph Health (CT3102 only)

# Ceph cluster health status
ceph_health_status

# OSD usage
ceph_osd_stat_bytes_used / ceph_osd_stat_bytes * 100

Log Queries (Loki - CT3102 only)

Access Loki via Grafana Explore or direct API.

# Recent errors across all containers
{job="docker"} |= "error" | limit 100

# Specific container logs
{container_name="prometheus"} | limit 50

# Logs with JSON parsing
{job="docker"} | json | level="error"

Troubleshooting

Prometheus Target Down

# Check if target host is reachable
ping <target_ip>

# Check if node_exporter is running on target
ssh root@<target_ip> systemctl status node_exporter

# Check firewall
ssh root@<target_ip> ss -tlnp | grep 9100

Container Issues (CT3102)

# View container logs
pct exec 3102 -- docker logs prometheus --tail 100
pct exec 3102 -- docker logs grafana --tail 100
pct exec 3102 -- docker logs loki --tail 100

# Restart specific container
pct exec 3102 -- docker restart prometheus

# Restart entire stack
pct exec 3102 -- bash -c "cd /opt/monitoring && docker compose restart"

# Rebuild and restart
pct exec 3102 -- bash -c "cd /opt/monitoring && docker compose up -d --force-recreate"

Container Issues (hub1)

# View logs
docker logs charliehub_prometheus --tail 100
docker logs charliehub_grafana --tail 100

# Restart
cd /opt/charliehub && docker compose restart prometheus grafana

Disk Full on Prometheus

# Check Prometheus data directory size
# On CT3102
pct exec 3102 -- du -sh /opt/monitoring/prometheus/data/

# On hub1
du -sh /opt/charliehub/monitoring/prometheus/data/

# If too large, adjust retention in prometheus.yml:
# --storage.tsdb.retention.time=15d
# --storage.tsdb.retention.size=5GB

Alertmanager Not Sending Alerts (CT3102)

# Check Alertmanager status
pct exec 3102 -- curl -s http://localhost:9093/api/v1/status

# Check active alerts
pct exec 3102 -- curl -s http://localhost:9093/api/v1/alerts

# View Alertmanager logs
pct exec 3102 -- docker logs alertmanager --tail 50

Adding New Monitoring Targets

Add target to CT3102

  1. Edit Prometheus config:

    pct exec 3102 -- vi /opt/monitoring/prometheus/prometheus.yml
    

  2. Add new target:

    - job_name: 'new-target'
      static_configs:
        - targets: ['10.44.1.xxx:9100']
          labels:
            instance: 'new-host'
    

  3. Reload Prometheus:

    pct exec 3102 -- curl -X POST http://localhost:9090/-/reload
    

Add target to hub1

  1. Edit Prometheus config:

    vi /opt/charliehub/monitoring/prometheus/prometheus.yml
    

  2. Reload:

    curl -X POST http://localhost:9090/-/reload
    

Grafana Dashboard Management

Import Dashboard (CT3102)

  1. Access Grafana at http://REDACTED_IP:3000
  2. Go to Dashboards > Import
  3. Enter dashboard ID from grafana.com or paste JSON
  4. Select Prometheus data source

Useful Dashboard IDs

Dashboard ID Description
Node Exporter Full 1860 Comprehensive host metrics
Docker Container 893 Container metrics
Loki Dashboard 13639 Log analysis
Ceph Cluster 2842 Ceph monitoring
UniFi-Poller 11315 Unifi network metrics

Ceph Alert Rules (hub2)

Prometheus alert rules for Ceph cluster health, deployed to /opt/charliehub/monitoring/prometheus/rules/ceph-alerts.yml.

Alert Escalation

Alert Duration Severity Action
CephSlowOps 5 min warning Investigate immediately
CephSlowOpsEscalated 15 min high Consider pausing backups/scrubs
CephSlowOpsCritical 30 min critical Intervention required NOW

Other Ceph Alerts

Alert Condition Severity
CephOSDDown Any OSD down for 1 min critical
CephMonitorDown Monitor out of quorum critical
CephPGDegraded PGs degraded for 5 min warning
CephHealthWarn Health WARN for 15 min warning
CephOSDHighLatency Apply latency > 100ms for 10 min warning

Managing Ceph Alerts

# View current alerts on hub2
ssh ubuntu@51.68.235.106 "sudo docker compose -f /opt/charliehub/docker-compose.yml exec prometheus wget -qO- http://localhost:9090/api/v1/alerts"

# Check rule file
ssh ubuntu@51.68.235.106 "cat /opt/charliehub/monitoring/prometheus/rules/ceph-alerts.yml"

# Reload after changes
ssh ubuntu@51.68.235.106 "sudo docker compose -f /opt/charliehub/docker-compose.yml restart prometheus"

Responding to Ceph Slow Ops

If CephSlowOps fires:

  1. Check Ceph health:

    ceph health detail
    ceph osd perf
    

  2. Check for concurrent I/O:

    # Backup jobs running?
    ps aux | grep vzdump
    
    # Scrubs running?
    ceph pg dump | grep scrub
    

  3. Emergency mitigation:

    # Pause scrubs
    ceph osd set noscrub
    ceph osd set nodeep-scrub
    
    # After issue resolved
    ceph osd unset noscrub
    ceph osd unset nodeep-scrub
    


Last updated: 2026-02-04