Alert Configuration and Management¶

Comprehensive guide to configuring, managing, and troubleshooting alerts in the CharlieHub monitoring system.

Quick Start¶

Alert Recipients¶

Role	Email	Purpose
Primary	cpaumelle@eroundit.eu	Main alert destination
Secondary	chpa35@gmail.com	Fallback after 5 minutes if primary unresponsive

Configured in: /opt/charliehub/.env

Alert Status¶

# Check if Alertmanager is healthy
docker compose ps alertmanager

# View active alerts
docker exec charliehub_alertmanager wget -qO- http://localhost:9093/api/v1/alerts | jq '.data[]'

# Check Prometheus alerts firing
curl -s http://localhost:9090/api/v1/alerts | jq '.data[] | select(.state=="firing")'

Alert Configuration¶

Email Recipients Configuration¶

File: /opt/charliehub/.env

# Primary alert recipient
ALERT_EMAIL=cpaumelle@eroundit.eu

# Secondary recipient (fallback after 5 min)
ALERT_EMAIL_SECONDARY=chpa35@gmail.com

# Critical-only recipient (for escalation)
ALERT_EMAIL_CRITICAL_ONLY=cpaumelle@eroundit.eu

SMTP Configuration¶

File: /opt/charliehub/.env

# Gmail SMTP settings
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=alerts@charliehub.net
SMTP_PASSWORD=<gmail-app-password>  # 16-char password
SMTP_FROM=cpaumelle@eroundit.eu     # Sender address
SMTP_REQUIRE_TLS=true               # Enforce TLS encryption

File: /opt/charliehub/monitoring/alertmanager/alertmanager.yml

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'cpaumelle@eroundit.eu'
  smtp_auth_username: 'alerts@charliehub.net'
  smtp_auth_password: 'kuum kiui aayk oqvj'
  smtp_require_tls: true

⚠️ Technical Debt: Email addresses are currently hardcoded in alertmanager.yml. This requires manual synchronization with .env values. See TECHNICAL-DEBT.md for refactoring plan.

Alert Rules¶

Overview¶

Prometheus evaluates alert rules continuously and sends firing alerts to Alertmanager. Alertmanager routes them to configured recipients.

Alert Rules Location: /opt/charliehub/monitoring/prometheus/rules/

Rule Group	File	Description	Rule Count
`monitoring_health`	`monitoring-health-alerts.yml`	Monitoring stack health checks	19
`traefik_health`	`traefik-alerts.yml`	Traefik reverse proxy metrics	3
`security`	`security-alerts.yml`	Meta-monitoring and security	Various

Monitoring Stack Health Alerts¶

File: /opt/charliehub/monitoring/prometheus/rules/monitoring-health-alerts.yml

These 19+ alert rules monitor the health of Prometheus, Alertmanager, Grafana, and Traefik themselves.

Prometheus Alerts¶

Alert	Condition	Duration	Severity	Action
`PrometheusDown`	Prometheus unreachable	3 min	CRITICAL	Immediate investigation required
`PrometheusHighMemoryUsage`	Prometheus > 1 GB RAM	5 min	WARNING	Check for memory leak; restart if needed
`PrometheusHighDiskUsage`	Prometheus disk > 90%	10 min	WARNING	Clean up old metrics; increase retention size
`PrometheusNotScraping`	No new metrics in 5 min	5 min	WARNING	Check target health; verify scrape config
`PrometheusTargetDown`	Scrape target unreachable	2 min	WARNING	Check target connectivity/firewall
`PrometheusConfigReloadFailed`	Config reload fails	1 min	WARNING	Validate YAML syntax; check file permissions

Alertmanager Alerts¶

Alert	Condition	Duration	Severity	Action
`AlertmanagerDown`	Alertmanager unreachable	2 min	CRITICAL	Alerts will not be routed - immediate fix required
`AlertmanagerHighMemoryUsage`	Alertmanager > 512 MB	5 min	WARNING	Check for memory leak; restart if needed
`AlertmanagerNotificationsFailed`	Email delivery failures	5 min	CRITICAL	Check SMTP credentials; verify recipient email

Grafana Alerts¶

Alert	Condition	Duration	Severity	Action
`GrafanaDown`	Grafana unreachable	2 min	WARNING	Restart container; check logs
`GrafanaHighMemoryUsage`	Grafana > 1 GB RAM	5 min	WARNING	Restart or optimize dashboards

Traefik Alerts¶

Alert	Condition	Duration	Severity	Action
`TraefikMetricsDown`	Metrics endpoint unreachable	2 min	WARNING	Restart Traefik; check Docker network
`TraefikHighErrorRate`	5xx error rate > 2%	5 min	CRITICAL	Investigate backend services; check logs
`TraefikSlowResponseTime`	P99 latency > 2 sec	5 min	WARNING	Profile slow endpoints; check database queries

Overall Stack Health¶

Alert	Condition	Duration	Severity	Action
`MonitoringStackDown`	2+ monitoring components down	5 min	CRITICAL	Major outage - full investigation required
`MonitoringMetricCardinalityHigh`	Time series > 500k	10 min	WARNING	Reduce labels in scrape configs; consider downsampling

Traefik Metrics Alerts¶

File: /opt/charliehub/monitoring/prometheus/rules/traefik-alerts.yml

Alerts for the Traefik reverse proxy, monitoring request rates, error rates, and latency.

Traefik Alert Rules¶

TraefikMetricsDown¶

Condition: Traefik metrics endpoint unreachable
Duration: 2 minutes
Severity: WARNING

Action:

docker compose restart traefik
docker compose logs traefik | tail -50

TraefikHighErrorRate¶

Condition: 5xx error rate > 2% (last 5 minutes)
Severity: CRITICAL

Action:

# Check Traefik logs for backend errors
docker compose logs traefik | grep -i "error\|5[0-9][0-9]" | tail -20

# Check backend services health
curl -s https://grafana.charliehub.net/api/health
curl -s https://prometheus.charliehub.net/-/healthy

TraefikSlowResponseTime¶

Condition: P99 latency > 2 seconds (last 5 minutes)
Severity: WARNING

Action:

# Check for slow database queries
docker compose logs charliehub-postgres | grep -i "slow\|duration" | tail -20

# Profile specific endpoints in Grafana
# Query: histogram_quantile(0.99, rate(traefik_entrypoint_request_duration_seconds_bucket[5m]))

External Health Monitoring (px5)¶

Overview¶

The px5 Proxmox node in France monitors hub2's health every 5 minutes and sends alerts if hub2 becomes unreachable.

Script: /usr/local/bin/hub2-health-check (on px5)

Health Checks Performed¶

DNS Resolution:
  - Resolves charliehub.net → checks for DNS failures

Network Connectivity:
  - Pings hub2 → detects network outages

HTTPS Port:
  - Checks port 443 open → detects firewall issues

Prometheus Health:
  - Calls /api/v1/status/flags → verifies metrics collection

Grafana Health:
  - Calls /api/health → verifies dashboards accessible

Results Logged¶

File: /var/log/hub2-healthcheck.json (on px5)

{
  "timestamp": "2026-02-06T17:37:14Z",
  "unix_timestamp": 1770399434,
  "status": "healthy",
  "connectivity_score": 3,
  "checks": {
    "dns": "up",
    "connectivity": "up",
    "prometheus": "down",
    "grafana": "down"
  }
}

Alert Triggers¶

Unreachable: Hub2 connectivity fails → email sent to ALERT_EMAIL
Recovery: Hub2 comes back online → recovery email sent
Frequency: Cron job runs every 5 minutes

Checking px5 Health Status¶

# SSH to px5
ssh px5

# View latest health check
cat /var/log/hub2-healthcheck.json | jq .

# Watch health checks in real-time
tail -f /var/log/hub2-healthcheck.log

# Check cron job status
cat /etc/cron.d/hub2-health-monitoring

# Run health check manually
/usr/local/bin/hub2-health-check

Alert Routing and Delivery¶

Alert Flow¶

Prometheus Alert Fires
         ↓
    Alertmanager receives
         ↓
    ┌─────────────────────────────┐
    │  Route by severity/label    │
    └─────────────────────────────┘
         ↓
    ┌─────────────────────────┐
    │ Critical Alerts         │
    │ ├─ group_wait: 10s      │
    │ └─ Send immediately     │
    └─────────────────────────┘
         ↓
    ┌──────────────────────────────┐
    │ Warning Alerts               │
    │ ├─ group_wait: 30s           │
    │ └─ Batch and send            │
    └──────────────────────────────┘
         ↓
    ├─ Primary: cpaumelle@eroundit.eu (immediate)
    │
    └─ Secondary: chpa35@gmail.com (after 5 min if no ack)

Alert Configuration (alertmanager.yml)¶

route:
  receiver: 'email-alerts'
  group_by: ['alertname', 'instance', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Critical alerts - immediate notification
    - match:
        severity: critical
      receiver: 'email-alerts'
      group_wait: 10s        # Send after 10s
      repeat_interval: 1h    # Re-send every hour

    # Warning alerts - batched
    - match:
        severity: warning
      receiver: 'email-alerts'
      repeat_interval: 4h    # Re-send every 4 hours

Email Format¶

Subject: [FIRING] AlertName - charliehub alert

From: cpaumelle@eroundit.eu
To: cpaumelle@eroundit.eu
Timestamp: <alert firing time>

Alert: PrometheusDown
Severity: CRITICAL
Instance: hub2
Status: FIRING

Description: Prometheus metrics collection is offline

Managing Alerts¶

Viewing Active Alerts¶

# Via Prometheus
curl -s http://localhost:9090/api/v1/alerts | jq '.data[] | select(.state=="firing")'

# Via Alertmanager
curl -s http://localhost:9093/api/v1/alerts | jq '.data[]'

# Via curl to Prometheus UI
curl -s https://prometheus.charliehub.net/api/v1/alerts | jq .

Silencing Alerts Temporarily¶

# Create a silence (1 hour)
curl -X POST http://localhost:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {
        "name": "alertname",
        "value": "PrometheusDown",
        "isRegex": false
      }
    ],
    "duration": "1h",
    "comment": "Maintenance window"
  }'

Adding New Alert Rules¶

Create rule file:

vi /opt/charliehub/monitoring/prometheus/rules/my-alerts.yml

Define rule group:

groups:
  - name: my_alerts
    interval: 30s
    rules:
      - alert: MyAlert
        expr: some_metric > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "My alert description"

Reload Prometheus:

docker compose restart prometheus
# or
curl -X POST http://localhost:9090/-/reload

Modifying Alert Recipients¶

Step 1: Update .env

vi /opt/charliehub/.env

Step 2: Update alertmanager.yml

vi /opt/charliehub/monitoring/alertmanager/alertmanager.yml

⚠️ Keep both files synchronized!

Step 3: Restart Alertmanager

docker compose restart alertmanager

Step 4: Test

# Verify config loaded
docker compose logs alertmanager | tail -10

# Check email config
docker exec charliehub_alertmanager cat /etc/alertmanager/alertmanager.yml | grep -A 5 "email_configs:"

Troubleshooting¶

Alertmanager Won't Start¶

Symptom: Container restart loop, YAML errors in logs

docker compose logs alertmanager | grep -i error

Common Causes: - Invalid YAML syntax in alertmanager.yml - Special characters in password not escaped - Invalid field names in configuration

Fix:

# Validate YAML syntax
docker run --rm -v /opt/charliehub/monitoring/alertmanager:/config alpine \
  sh -c "wget -qO- https://raw.githubusercontent.com/instrumenta/kubeval/master/scripts/install.sh | sh"

# Or manually check file
cat /opt/charliehub/monitoring/alertmanager/alertmanager.yml | head -20

Emails Not Sending¶

Symptom: Alerts fire but no emails received

Diagnosis:

# 1. Check Alertmanager is running
docker compose ps alertmanager

# 2. Check config loaded successfully
docker compose logs alertmanager | grep "Completed loading"

# 3. Check active alerts in Alertmanager
curl -s http://localhost:9093/api/v1/alerts | jq '.data[] | {alertname, status}'

# 4. Check Alertmanager logs for SMTP errors
docker compose logs alertmanager | grep -i "smtp\|email\|error" | tail -20

# 5. Test SMTP connectivity
docker exec charliehub_alertmanager \
  nc -zv smtp.gmail.com 587

Common Fixes: - Gmail app password incorrect or expired → Generate new one - SMTP_USER doesn't have email sending permissions → Use valid Gmail account - Email address typo in alertmanager.yml → Verify both recipients - Firewall blocking port 587 → Check network policy

Prometheus Alerts Not Firing¶

Symptom: Alert rule defined but never fires

# 1. Check rule is loaded
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | {name, rule_count: (.rules | length)}'

# 2. Check rule is evaluating
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name=="monitoring_health") | .rules[0]'

# 3. Check metric exists
curl -s http://localhost:9090/api/v1/query?query=up | jq '.data.result[]' | head -5

# 4. Test the PromQL expression
curl -s http://localhost:9090/api/v1/query?query='up{job="prometheus"}' | jq .

Common Causes: - Metric doesn't exist → Check Prometheus targets are UP - Expression syntax error → Validate in Prometheus UI - Threshold too high → Alert never reaches condition - Target down → No data to evaluate

px5 Health Monitoring Not Running¶

Symptom: /var/log/hub2-healthcheck.json not updating

# SSH to px5 and check:

# 1. Health check script exists
ls -la /usr/local/bin/hub2-health-check

# 2. Cron job configured
cat /etc/cron.d/hub2-health-monitoring

# 3. Manually run health check
/usr/local/bin/hub2-health-check

# 4. Check cron logs
tail -f /var/log/syslog | grep hub2-health

Fix: Re-run setup script

# From hub2:
ssh px5 'bash -s' < /opt/charliehub/monitoring/scripts/setup-px5-health-monitoring.sh

Best Practices¶

Alert Configuration¶

✅ Do: - Keep alert thresholds realistic based on historical data - Use meaningful alert names that describe the problem - Include actionable annotations (what to check) - Test new rules in staging first - Document why each alert exists

❌ Don't: - Create alert storms (too sensitive, fires constantly) - Ignore repeated alerts - Disable alerts without root cause fix - Use hardcoded IPs/domains in annotations - Set very long for durations (slower incident response)

Alert Response Process¶

Alert Fires → Email received
Acknowledge → Reply to email or check Alertmanager UI
Investigate → Review logs, check target health
Fix → Apply corrective action
Verify → Confirm alert resolves in Prometheus
Review → Document in incident log

References¶

Alertmanager Official Docs
Alert Rules Documentation
PromQL Best Practices
Monitoring Operations - Day-to-day monitoring tasks

Last Updated: 2026-02-06 Status: Phase 1 Complete - Production Ready