Traefik Metrics Monitoring¶

Complete guide to Traefik reverse proxy metrics monitoring and troubleshooting.

Overview¶

Traefik (the reverse proxy) exposes Prometheus metrics on an internal endpoint that Prometheus scrapes every 30 seconds. This provides visibility into request rates, error rates, response times, and connection counts.

Metrics Endpoint: http://charliehub-traefik:8082/metrics (Docker network only)

Configuration¶

Traefik Configuration (traefik.yml)¶

File: /opt/charliehub/traefik/config/traefik.yml

metrics:
  prometheus:
    addEntryPointsLabels: false    # Only monitor endpoints, not individual routers
    addRoutersLabels: false         # Prevents high cardinality
    addServicesLabels: false
    entryPoint: metrics
    manualRouting: false

entryPoints:
  web:
    address: ":80"
  websecure:
    address: ":443"
  metrics:
    address: "127.0.0.1:8082"       # Docker network access only (not exposed)

Prometheus Configuration¶

File: /opt/charliehub/monitoring/prometheus/targets/hub2-apps.yml

- targets: ['charliehub-traefik:8082']
  labels:
    instance: 'traefik'
    app: 'traefik'
    site: 'ovh'

Prometheus Scrape Settings¶

File: /opt/charliehub/monitoring/prometheus/prometheus.yml

- job_name: 'hub2-apps'
  metrics_path: /metrics
  scrape_interval: 30s          # Less frequent than default (15s)
  scrape_timeout: 10s           # Timeout if Traefik doesn't respond
  file_sd_configs:
    - files: ['/etc/prometheus/targets/hub2-apps.yml']

Traefik Metrics¶

Available Metrics¶

Traefik exposes the following Prometheus metrics:

Configuration Metrics¶

Metric	Type	Description	Labels
`traefik_config_last_reload_success`	Gauge	Last reload success (1=success, 0=failure)	`(none)`
`traefik_config_last_reload_success_timestamp_seconds`	Gauge	Timestamp of last successful config reload	`(none)`

Request Metrics¶

Metric	Type	Description	Labels
`traefik_entrypoint_requests_total`	Counter	Total HTTP requests received	`entrypoint`, `method`, `code`
`traefik_entrypoint_request_duration_seconds`	Histogram	HTTP request duration in seconds	`entrypoint`, `method`, `code`
`traefik_entrypoint_open_connections`	Gauge	Current open connections	`entrypoint`

Example Query¶

# Total requests by entrypoint
sum(rate(traefik_entrypoint_requests_total[5m])) by (entrypoint)

# Error rate (5xx percentage)
sum(rate(traefik_entrypoint_requests_total{code=~"5.."}[5m]))
/ sum(rate(traefik_entrypoint_requests_total[5m]))

# P95 response time
histogram_quantile(0.95, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le))

Alert Rules¶

Traefik Alert Rules¶

File: /opt/charliehub/monitoring/prometheus/rules/traefik-alerts.yml

TraefikMetricsDown¶

Expression: up{instance="traefik"} == 0
Duration: 2 minutes
Severity: WARNING
Meaning: Traefik metrics endpoint is unreachable

Action:

docker compose restart traefik
docker compose ps traefik

TraefikHighErrorRate¶

Expression:

sum(rate(traefik_entrypoint_requests_total{code=~"5.."}[5m]))
/ sum(rate(traefik_entrypoint_requests_total[5m])) > 0.02

Duration: 5 minutes
Severity: CRITICAL
Meaning: More than 2% of requests are returning 5xx errors

Action:

# Check for backend service failures
docker ps | grep -E "prometheus|grafana|alertmanager"
docker compose logs traefik | grep -i "error\|5[0-9][0-9]" | tail -20

TraefikSlowResponseTime¶

Expression:

histogram_quantile(0.99, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le)) > 2

Duration: 5 minutes
Severity: WARNING
Meaning: 99th percentile response time is over 2 seconds

Action:

# Check database performance
docker compose logs charliehub-postgres | grep -i "slow\|duration" | tail -10

# Check backend service logs
docker compose logs prometheus grafana alertmanager | grep -i "slow\|timeout"

Monitoring Dashboards¶

Importing Grafana Dashboard¶

Official Traefik Dashboard (Dashboard ID: 17346)

Access Grafana
```
https://grafana.charliehub.net
```
Import Dashboard
Click: Dashboards → Import
Enter ID: 17346
Select Data Source: Prometheus
Click: Import
Verify Metrics
Dashboard should show request rates
If showing "No Data", verify:
- Prometheus scraping Traefik: curl -s http://localhost:9090/api/v1/targets | grep traefik
- Traefik is receiving requests: curl https://grafana.charliehub.net should create metrics

Custom Queries for Grafana¶

Request Rate (req/sec)¶

sum(rate(traefik_entrypoint_requests_total[5m])) by (entrypoint)

Error Rate Percentage¶

sum(rate(traefik_entrypoint_requests_total{code=~"5.."}[5m])) by (entrypoint)
/ sum(rate(traefik_entrypoint_requests_total[5m])) by (entrypoint) * 100

Response Time Percentiles¶

# P50 (median)
histogram_quantile(0.50, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le))

# P95
histogram_quantile(0.95, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le))

# P99
histogram_quantile(0.99, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le))

Open Connections¶

sum(traefik_entrypoint_open_connections) by (entrypoint)

Troubleshooting¶

Traefik Metrics Endpoint Down¶

Symptom: Alert TraefikMetricsDown firing

Diagnosis:

# 1. Check if Traefik container is running
docker compose ps traefik

# 2. Check if metrics port is listening
docker exec charliehub-traefik netstat -tlnp | grep 8082

# 3. Try to access metrics endpoint
curl -s http://localhost:8082/metrics | head -20

# 4. Check Traefik logs
docker compose logs traefik | tail -50

Fixes:

# Restart Traefik
docker compose restart traefik

# If still failing, check configuration
docker exec charliehub-traefik cat /traefik/traefik.yml | grep -A 10 "metrics:"

No Metrics Data in Grafana¶

Symptom: Dashboard shows "No Data" or blank panels

Diagnosis:

# 1. Check Prometheus is scraping Traefik
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.instance=="traefik")'

# Expected: health: "up"

# 2. Check if metrics exist
curl -s http://localhost:9090/api/v1/query?query='traefik_entrypoint_requests_total' | jq '.data.result | length'

# Expected: > 0

# 3. Generate some traffic to create metrics
for i in {1..10}; do
  curl -s https://grafana.charliehub.net >/dev/null 2>&1
done

# 4. Wait 30 seconds (scrape interval) and check again
sleep 30
curl -s http://localhost:9090/api/v1/query?query='traefik_entrypoint_requests_total' | jq '.data.result'

Fixes:

# 1. If target is DOWN, restart Traefik
docker compose restart traefik

# 2. If target is UP but no metrics, check if Traefik received requests
# (Metrics only appear after first request is processed)

# 3. Generate test requests
curl https://prometheus.charliehub.net/api/v1/status/flags

# 4. Wait and verify metrics appear
sleep 30
curl -s http://localhost:9090/api/v1/query?query='traefik_entrypoint_requests_total' | jq '.data.result[0]'

High Error Rate Alert¶

Symptom: Alert TraefikHighErrorRate firing

Diagnosis:

# 1. Check which backend services are having issues
docker ps | grep -E "prometheus|grafana|alertmanager"
docker compose ps

# 2. Check service logs
docker compose logs prometheus --tail 50 | grep -i "error"
docker compose logs grafana --tail 50 | grep -i "error"

# 3. Check database connectivity
docker compose logs charliehub-postgres --tail 20
docker compose logs charliehub_redis --tail 20

# 4. Check Traefik logs for routing errors
docker compose logs traefik | grep -i "error\|backend" | tail -20

Fixes:

# Restart failing service
docker compose restart prometheus grafana

# Or restart all monitoring services
docker compose restart prometheus alertmanager grafana

# Monitor error rate
watch -n 5 'curl -s "http://localhost:9090/api/v1/query?query=sum(rate(traefik_entrypoint_requests_total{code=~\"5..\"}\[5m\]))/sum(rate(traefik_entrypoint_requests_total\[5m\]))" | jq ".data.result[0].value[1]"'

Slow Response Times¶

Symptom: Alert TraefikSlowResponseTime firing

Diagnosis:

# 1. Check database performance
docker exec charliehub_postgres psql -U postgres -d charliehub -c "SELECT * FROM pg_stat_statements WHERE mean_exec_time > 1000 LIMIT 10;"

# 2. Check resource usage
docker stats --no-stream | grep -E "charliehub|prometheus|grafana"

# 3. Check if any service is running slow operations
docker compose logs prometheus | grep -i "slow\|duration" | tail -10

# 4. Profile slow endpoints
curl -w "Total: %{time_total}s, Connect: %{time_connect}s, Transfer: %{time_starttransfer}s\n" -o /dev/null -s https://grafana.charliehub.net/api/health

Fixes:

# 1. Optimize database queries (see PostgreSQL docs)

# 2. Increase resource limits if available
docker compose exec prometheus sh -c 'free -h'

# 3. Reduce data retention if Prometheus is slow
# Edit prometheus.yml:
# --storage.tsdb.retention.time=15d
# --storage.tsdb.retention.size=10GB
docker compose restart prometheus

# 4. Check for background backup/scrub operations
ps aux | grep -E "backup|scrub"

Performance Considerations¶

Cardinality Impact¶

The current Traefik configuration is safe for cardinality:

Metrics: ~15
Labels per metric: 2 (entrypoint only)
Unique label combinations: 2-5 (web, websecure)
Total time series: ~30 (very small)

Why we disabled service/router labels:

If enabled:
- Routers: ~35 unique routers
- Services: ~29 unique services
- Combined: 35 × 29 × 2 (methods) × 10 (status codes) = 20,300 time series
- Impact: 4-6 GB Prometheus memory, slow queries

Disabled (current):
- Only entrypoint labels (2 values)
- Total: ~30 time series
- Impact: <50 MB memory

Scrape Interval¶

Current: 30 seconds
Default: 15 seconds
Reason: Reduced to minimize overhead; 30s is sufficient for alert detection (2+ minute delays acceptable)

Metrics Validation¶

Verify Traefik Metrics Available¶

# 1. Check raw metrics endpoint
curl -s http://charliehub-traefik:8082/metrics | head -50

# Expected output includes:
# # HELP traefik_config_last_reload_success Timestamp of last config reload
# # TYPE traefik_config_last_reload_success gauge
# traefik_config_last_reload_success 1

Verify Prometheus Scraping¶

# Check target status
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.instance=="traefik")'

# Expected:
# {
#   "discoveredLabels": {...},
#   "labels": {
#     "instance": "traefik",
#     "app": "traefik",
#     "site": "ovh"
#   },
#   "scrapeUrl": "http://charliehub-traefik:8082/metrics",
#   "lastError": "",
#   "lastScrape": "2026-02-06T18:00:00Z",
#   "lastScrapeDuration": 0.012,
#   "health": "up"
# }

Generate Test Metrics¶

# Make requests to create metrics
for i in {1..100}; do
  curl -s https://prometheus.charliehub.net >/dev/null &
done
wait

# Wait 30 seconds for scrape
sleep 30

# Query metrics
curl -s 'http://localhost:9090/api/v1/query?query=traefik_entrypoint_requests_total' | jq '.data.result[] | {metric, value: .value[1]}'

Best Practices¶

Monitoring Strategy¶

✅ Do: - Monitor request rate trends - Alert on error rate spikes - Track response time percentiles - Monitor for config reload failures

❌ Don't: - Monitor individual service metrics (causes high cardinality) - Set error rate threshold too low (causes alert fatigue) - Scrape more frequently than needed (default 15s is already frequent)

Alert Tuning¶

High Error Rate: Threshold at 2% (threshold: code=~"5..")
Lower = alert fatigue
Higher = slower incident detection
Slow Response: Threshold at 2 seconds (P99)
Lower = might trigger for normal queries
Higher = misses genuine slowness

Maintenance¶

# Weekly: Check alert rules are evaluating
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name=="traefik_health") | {name, rules: (.rules | length)}'

# Monthly: Review dashboard for trends
# - Are error rates increasing?
# - Are response times getting slower?
# - Any patterns in traffic?

References¶

Last Updated: 2026-02-06 Status: Production Ready