Skip to content

Alert Configuration and Management

Comprehensive guide to configuring, managing, and troubleshooting alerts in the CharlieHub monitoring system.

Quick Start

Alert Recipients

Role Email Purpose
Primary cpaumelle@eroundit.eu Main alert destination
Secondary chpa35@gmail.com Fallback after 5 minutes if primary unresponsive

Configured in: /opt/charliehub/.env

Alert Status

# Check if Alertmanager is healthy
docker compose ps alertmanager

# View active alerts
docker exec charliehub_alertmanager wget -qO- http://localhost:9093/api/v1/alerts | jq '.data[]'

# Check Prometheus alerts firing
curl -s http://localhost:9090/api/v1/alerts | jq '.data[] | select(.state=="firing")'

Alert Configuration

Email Recipients Configuration

File: /opt/charliehub/.env

# Primary alert recipient
ALERT_EMAIL=cpaumelle@eroundit.eu

# Secondary recipient (fallback after 5 min)
ALERT_EMAIL_SECONDARY=chpa35@gmail.com

# Critical-only recipient (for escalation)
ALERT_EMAIL_CRITICAL_ONLY=cpaumelle@eroundit.eu

SMTP Configuration

File: /opt/charliehub/.env

# Gmail SMTP settings
SMTP_HOST=smtp.gmail.com
SMTP_PORT=587
SMTP_USER=alerts@charliehub.net
SMTP_PASSWORD=<gmail-app-password>  # 16-char password
SMTP_FROM=cpaumelle@eroundit.eu     # Sender address
SMTP_REQUIRE_TLS=true               # Enforce TLS encryption

File: /opt/charliehub/monitoring/alertmanager/alertmanager.yml

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'cpaumelle@eroundit.eu'
  smtp_auth_username: 'alerts@charliehub.net'
  smtp_auth_password: 'kuum kiui aayk oqvj'
  smtp_require_tls: true

⚠️ Technical Debt: Email addresses are currently hardcoded in alertmanager.yml. This requires manual synchronization with .env values. See TECHNICAL-DEBT.md for refactoring plan.


Alert Rules

Overview

Prometheus evaluates alert rules continuously and sends firing alerts to Alertmanager. Alertmanager routes them to configured recipients.

Alert Rules Location: /opt/charliehub/monitoring/prometheus/rules/

Rule Group File Description Rule Count
monitoring_health monitoring-health-alerts.yml Monitoring stack health checks 19
traefik_health traefik-alerts.yml Traefik reverse proxy metrics 3
security security-alerts.yml Meta-monitoring and security Various

Monitoring Stack Health Alerts

File: /opt/charliehub/monitoring/prometheus/rules/monitoring-health-alerts.yml

These 19+ alert rules monitor the health of Prometheus, Alertmanager, Grafana, and Traefik themselves.

Prometheus Alerts

Alert Condition Duration Severity Action
PrometheusDown Prometheus unreachable 3 min CRITICAL Immediate investigation required
PrometheusHighMemoryUsage Prometheus > 1 GB RAM 5 min WARNING Check for memory leak; restart if needed
PrometheusHighDiskUsage Prometheus disk > 90% 10 min WARNING Clean up old metrics; increase retention size
PrometheusNotScraping No new metrics in 5 min 5 min WARNING Check target health; verify scrape config
PrometheusTargetDown Scrape target unreachable 2 min WARNING Check target connectivity/firewall
PrometheusConfigReloadFailed Config reload fails 1 min WARNING Validate YAML syntax; check file permissions

Alertmanager Alerts

Alert Condition Duration Severity Action
AlertmanagerDown Alertmanager unreachable 2 min CRITICAL Alerts will not be routed - immediate fix required
AlertmanagerHighMemoryUsage Alertmanager > 512 MB 5 min WARNING Check for memory leak; restart if needed
AlertmanagerNotificationsFailed Email delivery failures 5 min CRITICAL Check SMTP credentials; verify recipient email

Grafana Alerts

Alert Condition Duration Severity Action
GrafanaDown Grafana unreachable 2 min WARNING Restart container; check logs
GrafanaHighMemoryUsage Grafana > 1 GB RAM 5 min WARNING Restart or optimize dashboards

Traefik Alerts

Alert Condition Duration Severity Action
TraefikMetricsDown Metrics endpoint unreachable 2 min WARNING Restart Traefik; check Docker network
TraefikHighErrorRate 5xx error rate > 2% 5 min CRITICAL Investigate backend services; check logs
TraefikSlowResponseTime P99 latency > 2 sec 5 min WARNING Profile slow endpoints; check database queries

Overall Stack Health

Alert Condition Duration Severity Action
MonitoringStackDown 2+ monitoring components down 5 min CRITICAL Major outage - full investigation required
MonitoringMetricCardinalityHigh Time series > 500k 10 min WARNING Reduce labels in scrape configs; consider downsampling

Traefik Metrics Alerts

File: /opt/charliehub/monitoring/prometheus/rules/traefik-alerts.yml

Alerts for the Traefik reverse proxy, monitoring request rates, error rates, and latency.

Traefik Alert Rules

TraefikMetricsDown

  • Condition: Traefik metrics endpoint unreachable
  • Duration: 2 minutes
  • Severity: WARNING
  • Action:
    docker compose restart traefik
    docker compose logs traefik | tail -50
    

TraefikHighErrorRate

  • Condition: 5xx error rate > 2% (last 5 minutes)
  • Severity: CRITICAL
  • Action:
    # Check Traefik logs for backend errors
    docker compose logs traefik | grep -i "error\|5[0-9][0-9]" | tail -20
    
    # Check backend services health
    curl -s https://grafana.charliehub.net/api/health
    curl -s https://prometheus.charliehub.net/-/healthy
    

TraefikSlowResponseTime

  • Condition: P99 latency > 2 seconds (last 5 minutes)
  • Severity: WARNING
  • Action:
    # Check for slow database queries
    docker compose logs charliehub-postgres | grep -i "slow\|duration" | tail -20
    
    # Profile specific endpoints in Grafana
    # Query: histogram_quantile(0.99, rate(traefik_entrypoint_request_duration_seconds_bucket[5m]))
    

External Health Monitoring (px5)

Overview

The px5 Proxmox node in France monitors hub2's health every 5 minutes and sends alerts if hub2 becomes unreachable.

Script: /usr/local/bin/hub2-health-check (on px5)

Health Checks Performed

DNS Resolution:
  - Resolves charliehub.net → checks for DNS failures

Network Connectivity:
  - Pings hub2 → detects network outages

HTTPS Port:
  - Checks port 443 open → detects firewall issues

Prometheus Health:
  - Calls /api/v1/status/flags → verifies metrics collection

Grafana Health:
  - Calls /api/health → verifies dashboards accessible

Results Logged

File: /var/log/hub2-healthcheck.json (on px5)

{
  "timestamp": "2026-02-06T17:37:14Z",
  "unix_timestamp": 1770399434,
  "status": "healthy",
  "connectivity_score": 3,
  "checks": {
    "dns": "up",
    "connectivity": "up",
    "prometheus": "down",
    "grafana": "down"
  }
}

Alert Triggers

  • Unreachable: Hub2 connectivity fails → email sent to ALERT_EMAIL
  • Recovery: Hub2 comes back online → recovery email sent
  • Frequency: Cron job runs every 5 minutes

Checking px5 Health Status

# SSH to px5
ssh px5

# View latest health check
cat /var/log/hub2-healthcheck.json | jq .

# Watch health checks in real-time
tail -f /var/log/hub2-healthcheck.log

# Check cron job status
cat /etc/cron.d/hub2-health-monitoring

# Run health check manually
/usr/local/bin/hub2-health-check

Alert Routing and Delivery

Alert Flow

Prometheus Alert Fires
         ↓
    Alertmanager receives
         ↓
    ┌─────────────────────────────┐
    │  Route by severity/label    │
    └─────────────────────────────┘
         ↓
    ┌─────────────────────────┐
    │ Critical Alerts         │
    │ ├─ group_wait: 10s      │
    │ └─ Send immediately     │
    └─────────────────────────┘
         ↓
    ┌──────────────────────────────┐
    │ Warning Alerts               │
    │ ├─ group_wait: 30s           │
    │ └─ Batch and send            │
    └──────────────────────────────┘
         ↓
    ├─ Primary: cpaumelle@eroundit.eu (immediate)
    │
    └─ Secondary: chpa35@gmail.com (after 5 min if no ack)

Alert Configuration (alertmanager.yml)

route:
  receiver: 'email-alerts'
  group_by: ['alertname', 'instance', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Critical alerts - immediate notification
    - match:
        severity: critical
      receiver: 'email-alerts'
      group_wait: 10s        # Send after 10s
      repeat_interval: 1h    # Re-send every hour

    # Warning alerts - batched
    - match:
        severity: warning
      receiver: 'email-alerts'
      repeat_interval: 4h    # Re-send every 4 hours

Email Format

Subject: [FIRING] AlertName - charliehub alert

From: cpaumelle@eroundit.eu
To: cpaumelle@eroundit.eu
Timestamp: <alert firing time>

Alert: PrometheusDown
Severity: CRITICAL
Instance: hub2
Status: FIRING

Description: Prometheus metrics collection is offline

Managing Alerts

Viewing Active Alerts

# Via Prometheus
curl -s http://localhost:9090/api/v1/alerts | jq '.data[] | select(.state=="firing")'

# Via Alertmanager
curl -s http://localhost:9093/api/v1/alerts | jq '.data[]'

# Via curl to Prometheus UI
curl -s https://prometheus.charliehub.net/api/v1/alerts | jq .

Silencing Alerts Temporarily

# Create a silence (1 hour)
curl -X POST http://localhost:9093/api/v1/silences \
  -H "Content-Type: application/json" \
  -d '{
    "matchers": [
      {
        "name": "alertname",
        "value": "PrometheusDown",
        "isRegex": false
      }
    ],
    "duration": "1h",
    "comment": "Maintenance window"
  }'

Adding New Alert Rules

  1. Create rule file:

    vi /opt/charliehub/monitoring/prometheus/rules/my-alerts.yml
    

  2. Define rule group:

    groups:
      - name: my_alerts
        interval: 30s
        rules:
          - alert: MyAlert
            expr: some_metric > 100
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "My alert description"
    

  3. Reload Prometheus:

    docker compose restart prometheus
    # or
    curl -X POST http://localhost:9090/-/reload
    

Modifying Alert Recipients

Step 1: Update .env

vi /opt/charliehub/.env

Step 2: Update alertmanager.yml

vi /opt/charliehub/monitoring/alertmanager/alertmanager.yml

⚠️ Keep both files synchronized!

Step 3: Restart Alertmanager

docker compose restart alertmanager

Step 4: Test

# Verify config loaded
docker compose logs alertmanager | tail -10

# Check email config
docker exec charliehub_alertmanager cat /etc/alertmanager/alertmanager.yml | grep -A 5 "email_configs:"


Troubleshooting

Alertmanager Won't Start

Symptom: Container restart loop, YAML errors in logs

docker compose logs alertmanager | grep -i error

Common Causes: - Invalid YAML syntax in alertmanager.yml - Special characters in password not escaped - Invalid field names in configuration

Fix:

# Validate YAML syntax
docker run --rm -v /opt/charliehub/monitoring/alertmanager:/config alpine \
  sh -c "wget -qO- https://raw.githubusercontent.com/instrumenta/kubeval/master/scripts/install.sh | sh"

# Or manually check file
cat /opt/charliehub/monitoring/alertmanager/alertmanager.yml | head -20

Emails Not Sending

Symptom: Alerts fire but no emails received

Diagnosis:

# 1. Check Alertmanager is running
docker compose ps alertmanager

# 2. Check config loaded successfully
docker compose logs alertmanager | grep "Completed loading"

# 3. Check active alerts in Alertmanager
curl -s http://localhost:9093/api/v1/alerts | jq '.data[] | {alertname, status}'

# 4. Check Alertmanager logs for SMTP errors
docker compose logs alertmanager | grep -i "smtp\|email\|error" | tail -20

# 5. Test SMTP connectivity
docker exec charliehub_alertmanager \
  nc -zv smtp.gmail.com 587

Common Fixes: - Gmail app password incorrect or expired → Generate new one - SMTP_USER doesn't have email sending permissions → Use valid Gmail account - Email address typo in alertmanager.yml → Verify both recipients - Firewall blocking port 587 → Check network policy

Prometheus Alerts Not Firing

Symptom: Alert rule defined but never fires

# 1. Check rule is loaded
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | {name, rule_count: (.rules | length)}'

# 2. Check rule is evaluating
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name=="monitoring_health") | .rules[0]'

# 3. Check metric exists
curl -s http://localhost:9090/api/v1/query?query=up | jq '.data.result[]' | head -5

# 4. Test the PromQL expression
curl -s http://localhost:9090/api/v1/query?query='up{job="prometheus"}' | jq .

Common Causes: - Metric doesn't exist → Check Prometheus targets are UP - Expression syntax error → Validate in Prometheus UI - Threshold too high → Alert never reaches condition - Target down → No data to evaluate

px5 Health Monitoring Not Running

Symptom: /var/log/hub2-healthcheck.json not updating

# SSH to px5 and check:

# 1. Health check script exists
ls -la /usr/local/bin/hub2-health-check

# 2. Cron job configured
cat /etc/cron.d/hub2-health-monitoring

# 3. Manually run health check
/usr/local/bin/hub2-health-check

# 4. Check cron logs
tail -f /var/log/syslog | grep hub2-health

Fix: Re-run setup script

# From hub2:
ssh px5 'bash -s' < /opt/charliehub/monitoring/scripts/setup-px5-health-monitoring.sh


Best Practices

Alert Configuration

Do: - Keep alert thresholds realistic based on historical data - Use meaningful alert names that describe the problem - Include actionable annotations (what to check) - Test new rules in staging first - Document why each alert exists

Don't: - Create alert storms (too sensitive, fires constantly) - Ignore repeated alerts - Disable alerts without root cause fix - Use hardcoded IPs/domains in annotations - Set very long for durations (slower incident response)

Alert Response Process

  1. Alert Fires → Email received
  2. Acknowledge → Reply to email or check Alertmanager UI
  3. Investigate → Review logs, check target health
  4. Fix → Apply corrective action
  5. Verify → Confirm alert resolves in Prometheus
  6. Review → Document in incident log

References


Last Updated: 2026-02-06 Status: Phase 1 Complete - Production Ready