Monitoring¶
Monitoring infrastructure spans two locations: homelab-monitor (CT3102) for comprehensive homelab monitoring, and hub2 for OVH/cloud monitoring.
Architecture Overview¶
┌─────────────────────────────────────┐
│ hub2 (OVH Dedicated) │
│ - Prometheus (self + ISP monitor) │
│ - Grafana │
│ - Node Exporter │
└─────────────────────────────────────┘
│
│ WireGuard VPN
│
┌─────────────────────────────────────────────────────────────────────────┐
│ UK Homelab (10.44.x.x) │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ homelab-monitor (CT3102) - px3-suzuka │ │
│ │ - Prometheus (Proxmox nodes, VMs, Ceph, PostgreSQL, Unifi) │ │
│ │ - Grafana │ │
│ │ - Loki + Promtail (log aggregation) │ │
│ │ - Alertmanager │ │
│ │ - Unpoller (Unifi metrics) │ │
│ │ - Pulse (Proxmox dashboard) │ │
│ │ - Homarr (service dashboard) │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ Monitored targets: │
│ - px1, px2, px3, px5 (Proxmox nodes) │
│ - CT1111, CT1112, CT1113 (production containers) │
│ - Ceph cluster │
│ - PostgreSQL │
│ - Unifi network devices │
└─────────────────────────────────────────────────────────────────────────┘
homelab-monitor (CT3102) - Primary Homelab Monitoring¶
| Property | Value |
|---|---|
| VMID | 3102 |
| Hostname | homelab-monitor |
| Location | px3-suzuka |
| IP | REDACTED_IP |
Services¶
| Service | Container | Port | Purpose |
|---|---|---|---|
| Prometheus | prometheus | 9090 | Metrics collection |
| Grafana | grafana | 3000 | Dashboards |
| Loki | loki | 3100 | Log aggregation |
| Promtail | promtail | 9080 | Log shipping |
| Alertmanager | alertmanager | 9093 | Alert routing |
| Unpoller | unpoller | 9130 | Unifi metrics |
| Pulse | pulse | 7655 | Proxmox dashboard |
| Homarr | homarr | 7575 | Service dashboard |
Configuration¶
/opt/monitoring/
├── docker-compose.yml
├── .env
├── prometheus/
│ └── prometheus.yml # Scrape targets
├── grafana/
│ └── data/
├── loki/
│ └── loki-config.yml
├── promtail/
│ └── promtail-config.yml
├── alertmanager/
│ └── alertmanager.yml
├── pulse/
└── homarr/
Prometheus Targets¶
CT3102 monitors the entire homelab infrastructure:
| Job | Targets | Description |
|---|---|---|
proxmox-nodes |
px1, px2, px3, px5 | Host metrics via node_exporter |
prod-containers |
CT1111, CT1112, CT1113 | Container metrics |
postgresql |
CT1112:9187 | PostgreSQL exporter |
ceph |
px1:9283 | Ceph cluster metrics |
unpoller |
localhost:9130 | Unifi network metrics |
Access¶
# SSH to CT3102
ssh root@REDACTED_IP # px3
pct enter 3102
# Or direct (if network allows)
ssh root@REDACTED_IP
Common Operations¶
# Check all containers
pct exec 3102 -- docker ps
# Restart monitoring stack
pct exec 3102 -- bash -c "cd /opt/monitoring && docker compose restart"
# View Prometheus targets
curl -s http://REDACTED_IP:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# View logs
pct exec 3102 -- docker logs prometheus --tail 50
pct exec 3102 -- docker logs grafana --tail 50
hub2 - Cloud/OVH Monitoring¶
hub2 runs a minimal Prometheus+Grafana setup for monitoring OVH infrastructure and itself.
| Property | Value |
|---|---|
| Location | hub2 (OVH Dedicated Server) |
| Prometheus Container | charliehub_prometheus |
| Grafana Container | charliehub_grafana |
| Config | /opt/charliehub/monitoring/ |
Services¶
| Service | Port | URL | Purpose |
|---|---|---|---|
| Prometheus | 9090 | https://prometheus.charliehub.net | Metrics database |
| Grafana | 3000 | https://grafana.charliehub.net | Dashboards |
| Traefik | 8082 | (internal) | Reverse proxy metrics |
| Node Exporter | 9100 | (host) | Host metrics |
Prometheus Targets (hub2)¶
hub2's Prometheus monitors all infrastructure via file-based service discovery:
| Job | Targets | Description |
|---|---|---|
hub2-node |
hub2 | Hub2 host metrics (node_exporter) |
proxmox-uk |
px1, px2, px3 | UK Proxmox nodes (node_exporter) |
proxmox-fr |
px5 | France Proxmox node (node_exporter) |
isp-monitor |
ct1118 | ISP Monitor container |
hub2-apps |
traefik, grafana, alertmanager, cbre-pipeline | Docker application metrics |
exporters |
postgres (x3), redis (x2) | Database exporters |
prometheus |
localhost:9090 | Self-monitoring |
Traefik Reverse Proxy Metrics¶
Traefik exposes metrics on the internal metrics endpoint (:8082/metrics) for Prometheus to scrape:
| Metric | Type | Description |
|---|---|---|
traefik_config_last_reload_success |
Gauge | Last reload success status (1=success, 0=failure) |
traefik_config_last_reload_success_timestamp_seconds |
Gauge | Timestamp of last successful reload |
traefik_entrypoint_requests_total |
Counter | Total HTTP requests by entrypoint, method, status code |
traefik_entrypoint_request_duration_seconds |
Histogram | HTTP request duration in seconds (by entrypoint, code) |
traefik_entrypoint_open_connections |
Gauge | Open connections by entrypoint |
Alert Rules:
- TraefikMetricsDown - Metrics endpoint unreachable (warning)
- TraefikHighErrorRate - 5xx error rate > 2% (critical)
- TraefikSlowResponseTime - P99 latency > 2 seconds (warning)
Configuration:
- Metrics entrypoint: :8082 (internal Docker network)
- Scrape interval: 30s (reduced to minimize overhead)
- Service/router labels: disabled (to prevent high cardinality)
Application Metrics¶
CBRE Pipeline exposes custom Prometheus metrics at /metrics:
| Metric | Type | Description |
|---|---|---|
cbre_http_requests_total |
Counter | Total HTTP requests by method, endpoint, status |
cbre_http_request_duration_seconds |
Histogram | Request latency in seconds |
cbre_uplinks_processed_total |
Counter | Total uplinks processed (pc/occ) |
cbre_aggregation_buckets_total |
Counter | Aggregation buckets created |
cbre_database_connected |
Gauge | Database connection status (1/0) |
Database Exporters¶
Hub2 runs exporters to monitor all PostgreSQL and Redis instances:
| Exporter | Target | Database |
|---|---|---|
| postgres-exporter-charliehub | charliehub-postgres | charliehub_domains |
| postgres-exporter-parking | parking-postgres | parking |
| postgres-exporter-cbre | cbre_postgres | cbre_pipeline |
| redis-exporter-authelia | authelia_redis | Authelia sessions |
| redis-exporter-parking | parking-redis | Parking cache |
Configuration Location¶
/opt/charliehub/monitoring/
├── prometheus/
│ ├── prometheus.yml # Main config (uses file_sd_configs)
│ ├── targets/ # Service discovery targets
│ │ ├── hub2.yml # Hub2 node exporter
│ │ ├── hub2-apps.yml # Traefik, Grafana, Alertmanager
│ │ ├── proxmox-uk.yml # px1, px2, px3
│ │ ├── proxmox-fr.yml # px5
│ │ ├── isp-monitor.yml # CT1118
│ │ └── exporters.yml # PostgreSQL + Redis exporters
│ ├── rules/ # Alert rules
│ │ ├── security-alerts.yml
│ │ ├── traefik-alerts.yml
│ │ └── ceph-alerts.yml
│ └── data/ # Metrics storage
├── alertmanager/
│ └── alertmanager.yml # Alert routing config
└── grafana/
└── data/ # Dashboards, settings
Common Operations (hub2)¶
# Check containers
docker ps | grep -E "prometheus|grafana"
# Restart monitoring
cd /opt/charliehub && docker compose restart prometheus grafana
# Check Prometheus health
curl http://localhost:9090/-/healthy
# Check Grafana health
curl http://localhost:3000/api/health
# View logs
docker logs charliehub_prometheus --tail 50
docker logs charliehub_grafana --tail 50
Grafana Access (hub2)¶
| Property | Value |
|---|---|
| URL | https://grafana.charliehub.net |
| User | admin |
| Password | See /opt/charliehub/.env (GRAFANA_ADMIN_PASSWORD) |
Comparison¶
| Feature | homelab-monitor (CT3102) | hub2 |
|---|---|---|
| Prometheus | Full homelab targets | All nodes + hub2 apps + databases |
| Grafana | Comprehensive dashboards | Basic dashboards |
| Alertmanager | Yes | Yes |
| PostgreSQL monitoring | CT1112 only | All 3 databases |
| Redis monitoring | No | Both Redis instances |
| Loki/Promtail | Yes | No |
| Unpoller | Yes (Unifi metrics) | No |
| Pulse | Yes (Proxmox dashboard) | No |
| Homarr | Yes (service dashboard) | No |
Health Checks¶
homelab-monitor (CT3102)¶
# From px3
pct exec 3102 -- docker ps --format "table {{.Names}}\t{{.Status}}"
# Check Prometheus targets
curl -s http://REDACTED_IP:9090/api/v1/targets | jq '.data.activeTargets | length'
hub2¶
# Quick check
curl -s https://prometheus.charliehub.net/-/healthy
curl -s https://grafana.charliehub.net/api/health
Alert System (hub2) - Phase 1¶
Overview¶
Hub2 runs a comprehensive alert system that monitors the health of the monitoring infrastructure itself, detects failures, and sends notifications via email.
Alert Capabilities¶
Monitoring Stack Health Alerts¶
19+ alert rules monitor: - Prometheus: Up/down, memory usage, disk usage, scraping health, config reload - Alertmanager: Health, email delivery success, memory usage - Grafana: Health, memory usage - Traefik: Metrics availability, error rates, response times
External Health Monitoring¶
px5 (France Proxmox) monitors hub2 every 5 minutes via external health checks: - DNS resolution - Network connectivity - HTTPS port availability - Prometheus health - Grafana health
Results are logged to /var/log/hub2-healthcheck.json on px5.
Configuration¶
Alert Recipients: - Primary: cpaumelle@eroundit.eu (immediate) - Secondary: chpa35@gmail.com (after 5 min if primary unresponsive)
Email Delivery: - SMTP: smtp.gmail.com:587 (Gmail app password) - Encryption: TLS enabled - Critical alerts: 10 second delay - Warning alerts: 30 second delay
Alert Rules Files:
/opt/charliehub/monitoring/prometheus/rules/
├── monitoring-health-alerts.yml # 19+ rules (Prometheus, Alertmanager, Grafana, Traefik)
├── traefik-alerts.yml # 3 rules (error rate, latency, availability)
└── security-alerts.yml # Meta-monitoring
Alert Response¶
When an alert fires: 1. Prometheus evaluates the condition 2. Alertmanager receives the alert 3. Email sent to primary recipient 4. If alert persists 5+ minutes, secondary email sent 5. Recovery email sent when condition resolves
Common Alerts You'll Receive¶
| Alert | When It Fires | Action |
|---|---|---|
| PrometheusDown | Prometheus unreachable | Check if container crashed; restart if needed |
| AlertmanagerNotificationsFailed | Email delivery fails | Check SMTP credentials; verify Gmail app password |
| TraefikHighErrorRate | 5xx errors > 2% | Investigate backend service logs |
| MonitoringStackDown | 2+ components offline | Critical outage - full investigation |
See Alert Configuration Guide for complete alert reference.
Related Documentation¶
- Operations: Monitoring - Day-to-day monitoring tasks
- Operations: Alerting - Alert configuration and management
- Operations: Traefik Metrics - Traefik monitoring
- Reference: SMTP Configuration - Email setup
- hub2 Services - Central services hub
- WireGuard VPN - VPN connectivity to homelabs
- Network Layout - Infrastructure overview