Alert Configuration and Management¶
Comprehensive guide to configuring, managing, and troubleshooting alerts in the CharlieHub monitoring system.
Quick Start¶
Alert Recipients¶
| Role | Purpose | |
|---|---|---|
| Primary | cpaumelle@eroundit.eu | Main alert destination |
| Secondary | chpa35@gmail.com | Fallback after 5 minutes if primary unresponsive |
Configured in: /opt/charliehub/.env
Alert Status¶
# Check if Alertmanager is healthy
docker compose ps alertmanager
# View active alerts
docker exec charliehub_alertmanager wget -qO- http://localhost:9093/api/v1/alerts | jq '.data[]'
# Check Prometheus alerts firing
curl -s http://localhost:9090/api/v1/alerts | jq '.data[] | select(.state=="firing")'
Alert Configuration¶
Email Recipients Configuration¶
File: /opt/charliehub/.env
# Primary alert recipient
ALERT_EMAIL=cpaumelle@eroundit.eu
# Secondary recipient (fallback after 5 min)
ALERT_EMAIL_SECONDARY=chpa35@gmail.com
# Critical-only recipient (for escalation)
ALERT_EMAIL_CRITICAL_ONLY=cpaumelle@eroundit.eu
SMTP Configuration¶
File: /opt/charliehub/.env
# Gmail SMTP settings
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=alerts@charliehub.net
SMTP_PASSWORD=<gmail-app-password> # 16-char password
SMTP_FROM=cpaumelle@eroundit.eu # Sender address
SMTP_REQUIRE_TLS=true # Enforce TLS encryption
File: /opt/charliehub/monitoring/alertmanager/alertmanager.yml
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'cpaumelle@eroundit.eu'
smtp_auth_username: 'alerts@charliehub.net'
smtp_auth_password: 'kuum kiui aayk oqvj'
smtp_require_tls: true
⚠️ Technical Debt: Email addresses are currently hardcoded in alertmanager.yml. This requires manual synchronization with .env values. See TECHNICAL-DEBT.md for refactoring plan.
Alert Rules¶
Overview¶
Prometheus evaluates alert rules continuously and sends firing alerts to Alertmanager. Alertmanager routes them to configured recipients.
Alert Rules Location: /opt/charliehub/monitoring/prometheus/rules/
| Rule Group | File | Description | Rule Count |
|---|---|---|---|
monitoring_health |
monitoring-health-alerts.yml |
Monitoring stack health checks | 19 |
traefik_health |
traefik-alerts.yml |
Traefik reverse proxy metrics | 3 |
security |
security-alerts.yml |
Meta-monitoring and security | Various |
Monitoring Stack Health Alerts¶
File: /opt/charliehub/monitoring/prometheus/rules/monitoring-health-alerts.yml
These 19+ alert rules monitor the health of Prometheus, Alertmanager, Grafana, and Traefik themselves.
Prometheus Alerts¶
| Alert | Condition | Duration | Severity | Action |
|---|---|---|---|---|
PrometheusDown |
Prometheus unreachable | 3 min | CRITICAL | Immediate investigation required |
PrometheusHighMemoryUsage |
Prometheus > 1 GB RAM | 5 min | WARNING | Check for memory leak; restart if needed |
PrometheusHighDiskUsage |
Prometheus disk > 90% | 10 min | WARNING | Clean up old metrics; increase retention size |
PrometheusNotScraping |
No new metrics in 5 min | 5 min | WARNING | Check target health; verify scrape config |
PrometheusTargetDown |
Scrape target unreachable | 2 min | WARNING | Check target connectivity/firewall |
PrometheusConfigReloadFailed |
Config reload fails | 1 min | WARNING | Validate YAML syntax; check file permissions |
Alertmanager Alerts¶
| Alert | Condition | Duration | Severity | Action |
|---|---|---|---|---|
AlertmanagerDown |
Alertmanager unreachable | 2 min | CRITICAL | Alerts will not be routed - immediate fix required |
AlertmanagerHighMemoryUsage |
Alertmanager > 512 MB | 5 min | WARNING | Check for memory leak; restart if needed |
AlertmanagerNotificationsFailed |
Email delivery failures | 5 min | CRITICAL | Check SMTP credentials; verify recipient email |
Grafana Alerts¶
| Alert | Condition | Duration | Severity | Action |
|---|---|---|---|---|
GrafanaDown |
Grafana unreachable | 2 min | WARNING | Restart container; check logs |
GrafanaHighMemoryUsage |
Grafana > 1 GB RAM | 5 min | WARNING | Restart or optimize dashboards |
Traefik Alerts¶
| Alert | Condition | Duration | Severity | Action |
|---|---|---|---|---|
TraefikMetricsDown |
Metrics endpoint unreachable | 2 min | WARNING | Restart Traefik; check Docker network |
TraefikHighErrorRate |
5xx error rate > 2% | 5 min | CRITICAL | Investigate backend services; check logs |
TraefikSlowResponseTime |
P99 latency > 2 sec | 5 min | WARNING | Profile slow endpoints; check database queries |
Overall Stack Health¶
| Alert | Condition | Duration | Severity | Action |
|---|---|---|---|---|
MonitoringStackDown |
2+ monitoring components down | 5 min | CRITICAL | Major outage - full investigation required |
MonitoringMetricCardinalityHigh |
Time series > 500k | 10 min | WARNING | Reduce labels in scrape configs; consider downsampling |
Traefik Metrics Alerts¶
File: /opt/charliehub/monitoring/prometheus/rules/traefik-alerts.yml
Alerts for the Traefik reverse proxy, monitoring request rates, error rates, and latency.
Traefik Alert Rules¶
TraefikMetricsDown¶
- Condition: Traefik metrics endpoint unreachable
- Duration: 2 minutes
- Severity: WARNING
- Action:
docker compose restart traefik docker compose logs traefik | tail -50
TraefikHighErrorRate¶
- Condition: 5xx error rate > 2% (last 5 minutes)
- Severity: CRITICAL
- Action:
# Check Traefik logs for backend errors docker compose logs traefik | grep -i "error\|5[0-9][0-9]" | tail -20 # Check backend services health curl -s https://grafana.charliehub.net/api/health curl -s https://prometheus.charliehub.net/-/healthy
TraefikSlowResponseTime¶
- Condition: P99 latency > 2 seconds (last 5 minutes)
- Severity: WARNING
- Action:
# Check for slow database queries docker compose logs charliehub-postgres | grep -i "slow\|duration" | tail -20 # Profile specific endpoints in Grafana # Query: histogram_quantile(0.99, rate(traefik_entrypoint_request_duration_seconds_bucket[5m]))
External Health Monitoring (px5)¶
Overview¶
The px5 Proxmox node in France monitors hub2's health every 5 minutes and sends alerts if hub2 becomes unreachable.
Script: /usr/local/bin/hub2-health-check (on px5)
Health Checks Performed¶
DNS Resolution:
- Resolves charliehub.net → checks for DNS failures
Network Connectivity:
- Pings hub2 → detects network outages
HTTPS Port:
- Checks port 443 open → detects firewall issues
Prometheus Health:
- Calls /api/v1/status/flags → verifies metrics collection
Grafana Health:
- Calls /api/health → verifies dashboards accessible
Results Logged¶
File: /var/log/hub2-healthcheck.json (on px5)
{
"timestamp": "2026-02-06T17:37:14Z",
"unix_timestamp": 1770399434,
"status": "healthy",
"connectivity_score": 3,
"checks": {
"dns": "up",
"connectivity": "up",
"prometheus": "down",
"grafana": "down"
}
}
Alert Triggers¶
- Unreachable: Hub2 connectivity fails → email sent to
ALERT_EMAIL - Recovery: Hub2 comes back online → recovery email sent
- Frequency: Cron job runs every 5 minutes
Checking px5 Health Status¶
# SSH to px5
ssh px5
# View latest health check
cat /var/log/hub2-healthcheck.json | jq .
# Watch health checks in real-time
tail -f /var/log/hub2-healthcheck.log
# Check cron job status
cat /etc/cron.d/hub2-health-monitoring
# Run health check manually
/usr/local/bin/hub2-health-check
Alert Routing and Delivery¶
Alert Flow¶
Prometheus Alert Fires
↓
Alertmanager receives
↓
┌─────────────────────────────┐
│ Route by severity/label │
└─────────────────────────────┘
↓
┌─────────────────────────┐
│ Critical Alerts │
│ ├─ group_wait: 10s │
│ └─ Send immediately │
└─────────────────────────┘
↓
┌──────────────────────────────┐
│ Warning Alerts │
│ ├─ group_wait: 30s │
│ └─ Batch and send │
└──────────────────────────────┘
↓
├─ Primary: cpaumelle@eroundit.eu (immediate)
│
└─ Secondary: chpa35@gmail.com (after 5 min if no ack)
Alert Configuration (alertmanager.yml)¶
route:
receiver: 'email-alerts'
group_by: ['alertname', 'instance', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# Critical alerts - immediate notification
- match:
severity: critical
receiver: 'email-alerts'
group_wait: 10s # Send after 10s
repeat_interval: 1h # Re-send every hour
# Warning alerts - batched
- match:
severity: warning
receiver: 'email-alerts'
repeat_interval: 4h # Re-send every 4 hours
Email Format¶
Subject: [FIRING] AlertName - charliehub alert
From: cpaumelle@eroundit.eu
To: cpaumelle@eroundit.eu
Timestamp: <alert firing time>
Alert: PrometheusDown
Severity: CRITICAL
Instance: hub2
Status: FIRING
Description: Prometheus metrics collection is offline
Managing Alerts¶
Viewing Active Alerts¶
# Via Prometheus
curl -s http://localhost:9090/api/v1/alerts | jq '.data[] | select(.state=="firing")'
# Via Alertmanager
curl -s http://localhost:9093/api/v1/alerts | jq '.data[]'
# Via curl to Prometheus UI
curl -s https://prometheus.charliehub.net/api/v1/alerts | jq .
Silencing Alerts Temporarily¶
# Create a silence (1 hour)
curl -X POST http://localhost:9093/api/v1/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [
{
"name": "alertname",
"value": "PrometheusDown",
"isRegex": false
}
],
"duration": "1h",
"comment": "Maintenance window"
}'
Adding New Alert Rules¶
-
Create rule file:
vi /opt/charliehub/monitoring/prometheus/rules/my-alerts.yml -
Define rule group:
groups: - name: my_alerts interval: 30s rules: - alert: MyAlert expr: some_metric > 100 for: 5m labels: severity: warning annotations: summary: "My alert description" -
Reload Prometheus:
docker compose restart prometheus # or curl -X POST http://localhost:9090/-/reload
Modifying Alert Recipients¶
Step 1: Update .env
vi /opt/charliehub/.env
Step 2: Update alertmanager.yml
vi /opt/charliehub/monitoring/alertmanager/alertmanager.yml
⚠️ Keep both files synchronized!
Step 3: Restart Alertmanager
docker compose restart alertmanager
Step 4: Test
# Verify config loaded
docker compose logs alertmanager | tail -10
# Check email config
docker exec charliehub_alertmanager cat /etc/alertmanager/alertmanager.yml | grep -A 5 "email_configs:"
Troubleshooting¶
Alertmanager Won't Start¶
Symptom: Container restart loop, YAML errors in logs
docker compose logs alertmanager | grep -i error
Common Causes:
- Invalid YAML syntax in alertmanager.yml
- Special characters in password not escaped
- Invalid field names in configuration
Fix:
# Validate YAML syntax
docker run --rm -v /opt/charliehub/monitoring/alertmanager:/config alpine \
sh -c "wget -qO- https://raw.githubusercontent.com/instrumenta/kubeval/master/scripts/install.sh | sh"
# Or manually check file
cat /opt/charliehub/monitoring/alertmanager/alertmanager.yml | head -20
Emails Not Sending¶
Symptom: Alerts fire but no emails received
Diagnosis:
# 1. Check Alertmanager is running
docker compose ps alertmanager
# 2. Check config loaded successfully
docker compose logs alertmanager | grep "Completed loading"
# 3. Check active alerts in Alertmanager
curl -s http://localhost:9093/api/v1/alerts | jq '.data[] | {alertname, status}'
# 4. Check Alertmanager logs for SMTP errors
docker compose logs alertmanager | grep -i "smtp\|email\|error" | tail -20
# 5. Test SMTP connectivity
docker exec charliehub_alertmanager \
nc -zv smtp.gmail.com 587
Common Fixes:
- Gmail app password incorrect or expired → Generate new one
- SMTP_USER doesn't have email sending permissions → Use valid Gmail account
- Email address typo in alertmanager.yml → Verify both recipients
- Firewall blocking port 587 → Check network policy
Prometheus Alerts Not Firing¶
Symptom: Alert rule defined but never fires
# 1. Check rule is loaded
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | {name, rule_count: (.rules | length)}'
# 2. Check rule is evaluating
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name=="monitoring_health") | .rules[0]'
# 3. Check metric exists
curl -s http://localhost:9090/api/v1/query?query=up | jq '.data.result[]' | head -5
# 4. Test the PromQL expression
curl -s http://localhost:9090/api/v1/query?query='up{job="prometheus"}' | jq .
Common Causes: - Metric doesn't exist → Check Prometheus targets are UP - Expression syntax error → Validate in Prometheus UI - Threshold too high → Alert never reaches condition - Target down → No data to evaluate
px5 Health Monitoring Not Running¶
Symptom: /var/log/hub2-healthcheck.json not updating
# SSH to px5 and check:
# 1. Health check script exists
ls -la /usr/local/bin/hub2-health-check
# 2. Cron job configured
cat /etc/cron.d/hub2-health-monitoring
# 3. Manually run health check
/usr/local/bin/hub2-health-check
# 4. Check cron logs
tail -f /var/log/syslog | grep hub2-health
Fix: Re-run setup script
# From hub2:
ssh px5 'bash -s' < /opt/charliehub/monitoring/scripts/setup-px5-health-monitoring.sh
Best Practices¶
Alert Configuration¶
✅ Do: - Keep alert thresholds realistic based on historical data - Use meaningful alert names that describe the problem - Include actionable annotations (what to check) - Test new rules in staging first - Document why each alert exists
❌ Don't:
- Create alert storms (too sensitive, fires constantly)
- Ignore repeated alerts
- Disable alerts without root cause fix
- Use hardcoded IPs/domains in annotations
- Set very long for durations (slower incident response)
Alert Response Process¶
- Alert Fires → Email received
- Acknowledge → Reply to email or check Alertmanager UI
- Investigate → Review logs, check target health
- Fix → Apply corrective action
- Verify → Confirm alert resolves in Prometheus
- Review → Document in incident log
References¶
- Alertmanager Official Docs
- Alert Rules Documentation
- PromQL Best Practices
- Monitoring Operations - Day-to-day monitoring tasks
Last Updated: 2026-02-06 Status: Phase 1 Complete - Production Ready