Monitoring¶

Monitoring infrastructure spans two locations: homelab-monitor (CT3102) for comprehensive homelab monitoring, and hub2 for OVH/cloud monitoring.

Architecture Overview¶

                    ┌─────────────────────────────────────┐
                    │        hub2 (OVH Dedicated)         │
                    │  - Prometheus (self + ISP monitor)  │
                    │  - Grafana                          │
                    │  - Node Exporter                    │
                    └─────────────────────────────────────┘
                                      │
                                      │ WireGuard VPN
                                      │
┌─────────────────────────────────────────────────────────────────────────┐
│                    UK Homelab (10.44.x.x)                               │
│  ┌───────────────────────────────────────────────────────────────────┐  │
│  │              homelab-monitor (CT3102) - px3-suzuka                │  │
│  │  - Prometheus (Proxmox nodes, VMs, Ceph, PostgreSQL, Unifi)       │  │
│  │  - Grafana                                                        │  │
│  │  - Loki + Promtail (log aggregation)                              │  │
│  │  - Alertmanager                                                   │  │
│  │  - Unpoller (Unifi metrics)                                       │  │
│  │  - Pulse (Proxmox dashboard)                                      │  │
│  │  - Homarr (service dashboard)                                     │  │
│  └───────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  Monitored targets:                                                     │
│  - px1, px2, px3, px5 (Proxmox nodes)                                   │
│  - CT1111, CT1112, CT1113 (production containers)                       │
│  - Ceph cluster                                                         │
│  - PostgreSQL                                                           │
│  - Unifi network devices                                                │
└─────────────────────────────────────────────────────────────────────────┘

homelab-monitor (CT3102) - Primary Homelab Monitoring¶

Property	Value
VMID	3102
Hostname	homelab-monitor
Location	px3-suzuka
IP	REDACTED_IP

Services¶

Service	Container	Port	Purpose
Prometheus	prometheus	9090	Metrics collection
Grafana	grafana	3000	Dashboards
Loki	loki	3100	Log aggregation
Promtail	promtail	9080	Log shipping
Alertmanager	alertmanager	9093	Alert routing
Unpoller	unpoller	9130	Unifi metrics
Pulse	pulse	7655	Proxmox dashboard
Homarr	homarr	7575	Service dashboard

Configuration¶

/opt/monitoring/
├── docker-compose.yml
├── .env
├── prometheus/
│   └── prometheus.yml      # Scrape targets
├── grafana/
│   └── data/
├── loki/
│   └── loki-config.yml
├── promtail/
│   └── promtail-config.yml
├── alertmanager/
│   └── alertmanager.yml
├── pulse/
└── homarr/

Prometheus Targets¶

CT3102 monitors the entire homelab infrastructure:

Job	Targets	Description
`proxmox-nodes`	px1, px2, px3, px5	Host metrics via node_exporter
`prod-containers`	CT1111, CT1112, CT1113	Container metrics
`postgresql`	CT1112:9187	PostgreSQL exporter
`ceph`	px1:9283	Ceph cluster metrics
`unpoller`	localhost:9130	Unifi network metrics

Access¶

# SSH to CT3102
ssh root@REDACTED_IP   # px3
pct enter 3102

# Or direct (if network allows)
ssh root@REDACTED_IP

Common Operations¶

# Check all containers
pct exec 3102 -- docker ps

# Restart monitoring stack
pct exec 3102 -- bash -c "cd /opt/monitoring && docker compose restart"

# View Prometheus targets
curl -s http://REDACTED_IP:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# View logs
pct exec 3102 -- docker logs prometheus --tail 50
pct exec 3102 -- docker logs grafana --tail 50

hub2 - Cloud/OVH Monitoring¶

hub2 runs a minimal Prometheus+Grafana setup for monitoring OVH infrastructure and itself.

Property	Value
Location	hub2 (OVH Dedicated Server)
Prometheus Container	charliehub_prometheus
Grafana Container	charliehub_grafana
Config	/opt/charliehub/monitoring/

Services¶

Service	Port	URL	Purpose
Prometheus	9090	https://prometheus.charliehub.net	Metrics database
Grafana	3000	https://grafana.charliehub.net	Dashboards
Traefik	8082	(internal)	Reverse proxy metrics
Node Exporter	9100	(host)	Host metrics

Prometheus Targets (hub2)¶

hub2's Prometheus monitors all infrastructure via file-based service discovery:

Job	Targets	Description
`hub2-node`	hub2	Hub2 host metrics (node_exporter)
`proxmox-uk`	px1, px2, px3	UK Proxmox nodes (node_exporter)
`proxmox-fr`	px5	France Proxmox node (node_exporter)
`isp-monitor`	ct1118	ISP Monitor container
`hub2-apps`	traefik, grafana, alertmanager, cbre-pipeline	Docker application metrics
`exporters`	postgres (x3), redis (x2)	Database exporters
`prometheus`	localhost:9090	Self-monitoring

Traefik Reverse Proxy Metrics¶

Traefik exposes metrics on the internal metrics endpoint (:8082/metrics) for Prometheus to scrape:

Metric	Type	Description
`traefik_config_last_reload_success`	Gauge	Last reload success status (1=success, 0=failure)
`traefik_config_last_reload_success_timestamp_seconds`	Gauge	Timestamp of last successful reload
`traefik_entrypoint_requests_total`	Counter	Total HTTP requests by entrypoint, method, status code
`traefik_entrypoint_request_duration_seconds`	Histogram	HTTP request duration in seconds (by entrypoint, code)
`traefik_entrypoint_open_connections`	Gauge	Open connections by entrypoint

Alert Rules: - TraefikMetricsDown - Metrics endpoint unreachable (warning) - TraefikHighErrorRate - 5xx error rate > 2% (critical) - TraefikSlowResponseTime - P99 latency > 2 seconds (warning)

Configuration: - Metrics entrypoint: :8082 (internal Docker network) - Scrape interval: 30s (reduced to minimize overhead) - Service/router labels: disabled (to prevent high cardinality)

Application Metrics¶

CBRE Pipeline exposes custom Prometheus metrics at /metrics:

Metric	Type	Description
`cbre_http_requests_total`	Counter	Total HTTP requests by method, endpoint, status
`cbre_http_request_duration_seconds`	Histogram	Request latency in seconds
`cbre_uplinks_processed_total`	Counter	Total uplinks processed (pc/occ)
`cbre_aggregation_buckets_total`	Counter	Aggregation buckets created
`cbre_database_connected`	Gauge	Database connection status (1/0)

Database Exporters¶

Hub2 runs exporters to monitor all PostgreSQL and Redis instances:

Exporter	Target	Database
postgres-exporter-charliehub	charliehub-postgres	charliehub_domains
postgres-exporter-parking	parking-postgres	parking
postgres-exporter-cbre	cbre_postgres	cbre_pipeline
redis-exporter-authelia	authelia_redis	Authelia sessions
redis-exporter-parking	parking-redis	Parking cache

Configuration Location¶

/opt/charliehub/monitoring/
├── prometheus/
│   ├── prometheus.yml          # Main config (uses file_sd_configs)
│   ├── targets/                # Service discovery targets
│   │   ├── hub2.yml           # Hub2 node exporter
│   │   ├── hub2-apps.yml      # Traefik, Grafana, Alertmanager
│   │   ├── proxmox-uk.yml     # px1, px2, px3
│   │   ├── proxmox-fr.yml     # px5
│   │   ├── isp-monitor.yml    # CT1118
│   │   └── exporters.yml      # PostgreSQL + Redis exporters
│   ├── rules/                  # Alert rules
│   │   ├── security-alerts.yml
│   │   ├── traefik-alerts.yml
│   │   └── ceph-alerts.yml
│   └── data/                   # Metrics storage
├── alertmanager/
│   └── alertmanager.yml        # Alert routing config
└── grafana/
    └── data/                   # Dashboards, settings

Common Operations (hub2)¶

# Check containers
docker ps | grep -E "prometheus|grafana"

# Restart monitoring
cd /opt/charliehub && docker compose restart prometheus grafana

# Check Prometheus health
curl http://localhost:9090/-/healthy

# Check Grafana health
curl http://localhost:3000/api/health

# View logs
docker logs charliehub_prometheus --tail 50
docker logs charliehub_grafana --tail 50

Grafana Access (hub2)¶

Property	Value
URL	https://grafana.charliehub.net
User	admin
Password	See `/opt/charliehub/.env` (GRAFANA_ADMIN_PASSWORD)

Comparison¶

Feature	homelab-monitor (CT3102)	hub2
Prometheus	Full homelab targets	All nodes + hub2 apps + databases
Grafana	Comprehensive dashboards	Basic dashboards
Alertmanager	Yes	Yes
PostgreSQL monitoring	CT1112 only	All 3 databases
Redis monitoring	No	Both Redis instances
Loki/Promtail	Yes	No
Unpoller	Yes (Unifi metrics)	No
Pulse	Yes (Proxmox dashboard)	No
Homarr	Yes (service dashboard)	No

Health Checks¶

homelab-monitor (CT3102)¶

# From px3
pct exec 3102 -- docker ps --format "table {{.Names}}\t{{.Status}}"

# Check Prometheus targets
curl -s http://REDACTED_IP:9090/api/v1/targets | jq '.data.activeTargets | length'

hub2¶

# Quick check
curl -s https://prometheus.charliehub.net/-/healthy
curl -s https://grafana.charliehub.net/api/health

Alert System (hub2) - Phase 1¶

Overview¶

Hub2 runs a comprehensive alert system that monitors the health of the monitoring infrastructure itself, detects failures, and sends notifications via email.

Alert Capabilities¶

Monitoring Stack Health Alerts¶

19+ alert rules monitor: - Prometheus: Up/down, memory usage, disk usage, scraping health, config reload - Alertmanager: Health, email delivery success, memory usage - Grafana: Health, memory usage - Traefik: Metrics availability, error rates, response times

External Health Monitoring¶

px5 (France Proxmox) monitors hub2 every 5 minutes via external health checks: - DNS resolution - Network connectivity - HTTPS port availability - Prometheus health - Grafana health

Results are logged to /var/log/hub2-healthcheck.json on px5.

Configuration¶

Alert Recipients: - Primary: cpaumelle@eroundit.eu (immediate) - Secondary: chpa35@gmail.com (after 5 min if primary unresponsive)

Email Delivery: - SMTP: smtp.gmail.com:587 (Gmail app password) - Encryption: TLS enabled - Critical alerts: 10 second delay - Warning alerts: 30 second delay

Alert Rules Files:

/opt/charliehub/monitoring/prometheus/rules/
├── monitoring-health-alerts.yml  # 19+ rules (Prometheus, Alertmanager, Grafana, Traefik)
├── traefik-alerts.yml            # 3 rules (error rate, latency, availability)
└── security-alerts.yml           # Meta-monitoring

Alert Response¶

When an alert fires: 1. Prometheus evaluates the condition 2. Alertmanager receives the alert 3. Email sent to primary recipient 4. If alert persists 5+ minutes, secondary email sent 5. Recovery email sent when condition resolves

Common Alerts You'll Receive¶

Alert	When It Fires	Action
PrometheusDown	Prometheus unreachable	Check if container crashed; restart if needed
AlertmanagerNotificationsFailed	Email delivery fails	Check SMTP credentials; verify Gmail app password
TraefikHighErrorRate	5xx errors > 2%	Investigate backend service logs
MonitoringStackDown	2+ components offline	Critical outage - full investigation

See Alert Configuration Guide for complete alert reference.

Operations: Monitoring - Day-to-day monitoring tasks
Operations: Alerting - Alert configuration and management
Operations: Traefik Metrics - Traefik monitoring
Reference: SMTP Configuration - Email setup
hub2 Services - Central services hub
WireGuard VPN - VPN connectivity to homelabs
Network Layout - Infrastructure overview

Monitoring¶

Architecture Overview¶

homelab-monitor (CT3102) - Primary Homelab Monitoring¶

Services¶

Configuration¶

Prometheus Targets¶

Access¶

Common Operations¶

hub2 - Cloud/OVH Monitoring¶

Services¶

Prometheus Targets (hub2)¶

Traefik Reverse Proxy Metrics¶

Application Metrics¶

Database Exporters¶

Configuration Location¶

Common Operations (hub2)¶

Grafana Access (hub2)¶

Comparison¶

Health Checks¶

homelab-monitor (CT3102)¶

hub2¶

Alert System (hub2) - Phase 1¶

Overview¶

Alert Capabilities¶

Monitoring Stack Health Alerts¶

External Health Monitoring¶

Configuration¶

Alert Response¶

Common Alerts You'll Receive¶

Related Documentation¶