Skip to content

Monitoring

Monitoring infrastructure spans two locations: homelab-monitor (CT3102) for comprehensive homelab monitoring, and hub2 for OVH/cloud monitoring.

Architecture Overview

                    ┌─────────────────────────────────────┐
                    │        hub2 (OVH Dedicated)         │
                    │  - Prometheus (self + ISP monitor)  │
                    │  - Grafana                          │
                    │  - Node Exporter                    │
                    └─────────────────────────────────────┘
                                      │
                                      │ WireGuard VPN
                                      │
┌─────────────────────────────────────────────────────────────────────────┐
│                    UK Homelab (10.44.x.x)                               │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │              homelab-monitor (CT3102) - px3-suzuka                │  │
│  │  - Prometheus (Proxmox nodes, VMs, Ceph, PostgreSQL, Unifi)       │  │
│  │  - Grafana                                                        │  │
│  │  - Loki + Promtail (log aggregation)                              │  │
│  │  - Alertmanager                                                   │  │
│  │  - Unpoller (Unifi metrics)                                       │  │
│  │  - Pulse (Proxmox dashboard)                                      │  │
│  │  - Homarr (service dashboard)                                     │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  Monitored targets:                                                     │
│  - px1, px2, px3, px5 (Proxmox nodes)                                   │
│  - CT1111, CT1112, CT1113 (production containers)                       │
│  - Ceph cluster                                                         │
│  - PostgreSQL                                                           │
│  - Unifi network devices                                                │
└─────────────────────────────────────────────────────────────────────────┘

homelab-monitor (CT3102) - Primary Homelab Monitoring

Property Value
VMID 3102
Hostname homelab-monitor
Location px3-suzuka
IP REDACTED_IP

Services

Service Container Port Purpose
Prometheus prometheus 9090 Metrics collection
Grafana grafana 3000 Dashboards
Loki loki 3100 Log aggregation
Promtail promtail 9080 Log shipping
Alertmanager alertmanager 9093 Alert routing
Unpoller unpoller 9130 Unifi metrics
Pulse pulse 7655 Proxmox dashboard
Homarr homarr 7575 Service dashboard

Configuration

/opt/monitoring/
├── docker-compose.yml
├── .env
├── prometheus/
│   └── prometheus.yml      # Scrape targets
├── grafana/
│   └── data/
├── loki/
│   └── loki-config.yml
├── promtail/
│   └── promtail-config.yml
├── alertmanager/
│   └── alertmanager.yml
├── pulse/
└── homarr/

Prometheus Targets

CT3102 monitors the entire homelab infrastructure:

Job Targets Description
proxmox-nodes px1, px2, px3, px5 Host metrics via node_exporter
prod-containers CT1111, CT1112, CT1113 Container metrics
postgresql CT1112:9187 PostgreSQL exporter
ceph px1:9283 Ceph cluster metrics
unpoller localhost:9130 Unifi network metrics

Access

# SSH to CT3102
ssh root@REDACTED_IP   # px3
pct enter 3102

# Or direct (if network allows)
ssh root@REDACTED_IP

Common Operations

# Check all containers
pct exec 3102 -- docker ps

# Restart monitoring stack
pct exec 3102 -- bash -c "cd /opt/monitoring && docker compose restart"

# View Prometheus targets
curl -s http://REDACTED_IP:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# View logs
pct exec 3102 -- docker logs prometheus --tail 50
pct exec 3102 -- docker logs grafana --tail 50

hub2 - Cloud/OVH Monitoring

hub2 runs a minimal Prometheus+Grafana setup for monitoring OVH infrastructure and itself.

Property Value
Location hub2 (OVH Dedicated Server)
Prometheus Container charliehub_prometheus
Grafana Container charliehub_grafana
Config /opt/charliehub/monitoring/

Services

Service Port URL Purpose
Prometheus 9090 https://prometheus.charliehub.net Metrics database
Grafana 3000 https://grafana.charliehub.net Dashboards
Traefik 8082 (internal) Reverse proxy metrics
Node Exporter 9100 (host) Host metrics

Prometheus Targets (hub2)

hub2's Prometheus monitors all infrastructure via file-based service discovery:

Job Targets Description
hub2-node hub2 Hub2 host metrics (node_exporter)
proxmox-uk px1, px2, px3 UK Proxmox nodes (node_exporter)
proxmox-fr px5 France Proxmox node (node_exporter)
isp-monitor ct1118 ISP Monitor container
hub2-apps traefik, grafana, alertmanager, cbre-pipeline Docker application metrics
exporters postgres (x3), redis (x2) Database exporters
prometheus localhost:9090 Self-monitoring

Traefik Reverse Proxy Metrics

Traefik exposes metrics on the internal metrics endpoint (:8082/metrics) for Prometheus to scrape:

Metric Type Description
traefik_config_last_reload_success Gauge Last reload success status (1=success, 0=failure)
traefik_config_last_reload_success_timestamp_seconds Gauge Timestamp of last successful reload
traefik_entrypoint_requests_total Counter Total HTTP requests by entrypoint, method, status code
traefik_entrypoint_request_duration_seconds Histogram HTTP request duration in seconds (by entrypoint, code)
traefik_entrypoint_open_connections Gauge Open connections by entrypoint

Alert Rules: - TraefikMetricsDown - Metrics endpoint unreachable (warning) - TraefikHighErrorRate - 5xx error rate > 2% (critical) - TraefikSlowResponseTime - P99 latency > 2 seconds (warning)

Configuration: - Metrics entrypoint: :8082 (internal Docker network) - Scrape interval: 30s (reduced to minimize overhead) - Service/router labels: disabled (to prevent high cardinality)

Application Metrics

CBRE Pipeline exposes custom Prometheus metrics at /metrics:

Metric Type Description
cbre_http_requests_total Counter Total HTTP requests by method, endpoint, status
cbre_http_request_duration_seconds Histogram Request latency in seconds
cbre_uplinks_processed_total Counter Total uplinks processed (pc/occ)
cbre_aggregation_buckets_total Counter Aggregation buckets created
cbre_database_connected Gauge Database connection status (1/0)

Database Exporters

Hub2 runs exporters to monitor all PostgreSQL and Redis instances:

Exporter Target Database
postgres-exporter-charliehub charliehub-postgres charliehub_domains
postgres-exporter-parking parking-postgres parking
postgres-exporter-cbre cbre_postgres cbre_pipeline
redis-exporter-authelia authelia_redis Authelia sessions
redis-exporter-parking parking-redis Parking cache

Configuration Location

/opt/charliehub/monitoring/
├── prometheus/
│   ├── prometheus.yml          # Main config (uses file_sd_configs)
│   ├── targets/                # Service discovery targets
│   │   ├── hub2.yml           # Hub2 node exporter
│   │   ├── hub2-apps.yml      # Traefik, Grafana, Alertmanager
│   │   ├── proxmox-uk.yml     # px1, px2, px3
│   │   ├── proxmox-fr.yml     # px5
│   │   ├── isp-monitor.yml    # CT1118
│   │   └── exporters.yml      # PostgreSQL + Redis exporters
│   ├── rules/                  # Alert rules
│   │   ├── security-alerts.yml
│   │   ├── traefik-alerts.yml
│   │   └── ceph-alerts.yml
│   └── data/                   # Metrics storage
├── alertmanager/
│   └── alertmanager.yml        # Alert routing config
└── grafana/
    └── data/                   # Dashboards, settings

Common Operations (hub2)

# Check containers
docker ps | grep -E "prometheus|grafana"

# Restart monitoring
cd /opt/charliehub && docker compose restart prometheus grafana

# Check Prometheus health
curl http://localhost:9090/-/healthy

# Check Grafana health
curl http://localhost:3000/api/health

# View logs
docker logs charliehub_prometheus --tail 50
docker logs charliehub_grafana --tail 50

Grafana Access (hub2)

Property Value
URL https://grafana.charliehub.net
User admin
Password See /opt/charliehub/.env (GRAFANA_ADMIN_PASSWORD)

Comparison

Feature homelab-monitor (CT3102) hub2
Prometheus Full homelab targets All nodes + hub2 apps + databases
Grafana Comprehensive dashboards Basic dashboards
Alertmanager Yes Yes
PostgreSQL monitoring CT1112 only All 3 databases
Redis monitoring No Both Redis instances
Loki/Promtail Yes No
Unpoller Yes (Unifi metrics) No
Pulse Yes (Proxmox dashboard) No
Homarr Yes (service dashboard) No

Health Checks

homelab-monitor (CT3102)

# From px3
pct exec 3102 -- docker ps --format "table {{.Names}}\t{{.Status}}"

# Check Prometheus targets
curl -s http://REDACTED_IP:9090/api/v1/targets | jq '.data.activeTargets | length'

hub2

# Quick check
curl -s https://prometheus.charliehub.net/-/healthy
curl -s https://grafana.charliehub.net/api/health

Alert System (hub2) - Phase 1

Overview

Hub2 runs a comprehensive alert system that monitors the health of the monitoring infrastructure itself, detects failures, and sends notifications via email.

Alert Capabilities

Monitoring Stack Health Alerts

19+ alert rules monitor: - Prometheus: Up/down, memory usage, disk usage, scraping health, config reload - Alertmanager: Health, email delivery success, memory usage - Grafana: Health, memory usage - Traefik: Metrics availability, error rates, response times

External Health Monitoring

px5 (France Proxmox) monitors hub2 every 5 minutes via external health checks: - DNS resolution - Network connectivity - HTTPS port availability - Prometheus health - Grafana health

Results are logged to /var/log/hub2-healthcheck.json on px5.

Configuration

Alert Recipients: - Primary: cpaumelle@eroundit.eu (immediate) - Secondary: chpa35@gmail.com (after 5 min if primary unresponsive)

Email Delivery: - SMTP: smtp.gmail.com:587 (Gmail app password) - Encryption: TLS enabled - Critical alerts: 10 second delay - Warning alerts: 30 second delay

Alert Rules Files:

/opt/charliehub/monitoring/prometheus/rules/
├── monitoring-health-alerts.yml  # 19+ rules (Prometheus, Alertmanager, Grafana, Traefik)
├── traefik-alerts.yml            # 3 rules (error rate, latency, availability)
└── security-alerts.yml           # Meta-monitoring

Alert Response

When an alert fires: 1. Prometheus evaluates the condition 2. Alertmanager receives the alert 3. Email sent to primary recipient 4. If alert persists 5+ minutes, secondary email sent 5. Recovery email sent when condition resolves

Common Alerts You'll Receive

Alert When It Fires Action
PrometheusDown Prometheus unreachable Check if container crashed; restart if needed
AlertmanagerNotificationsFailed Email delivery fails Check SMTP credentials; verify Gmail app password
TraefikHighErrorRate 5xx errors > 2% Investigate backend service logs
MonitoringStackDown 2+ components offline Critical outage - full investigation

See Alert Configuration Guide for complete alert reference.