Traefik Metrics Monitoring¶
Complete guide to Traefik reverse proxy metrics monitoring and troubleshooting.
Overview¶
Traefik (the reverse proxy) exposes Prometheus metrics on an internal endpoint that Prometheus scrapes every 30 seconds. This provides visibility into request rates, error rates, response times, and connection counts.
Metrics Endpoint: http://charliehub-traefik:8082/metrics (Docker network only)
Configuration¶
Traefik Configuration (traefik.yml)¶
File: /opt/charliehub/traefik/config/traefik.yml
metrics:
prometheus:
addEntryPointsLabels: false # Only monitor endpoints, not individual routers
addRoutersLabels: false # Prevents high cardinality
addServicesLabels: false
entryPoint: metrics
manualRouting: false
entryPoints:
web:
address: ":80"
websecure:
address: ":443"
metrics:
address: "127.0.0.1:8082" # Docker network access only (not exposed)
Prometheus Configuration¶
File: /opt/charliehub/monitoring/prometheus/targets/hub2-apps.yml
- targets: ['charliehub-traefik:8082']
labels:
instance: 'traefik'
app: 'traefik'
site: 'ovh'
Prometheus Scrape Settings¶
File: /opt/charliehub/monitoring/prometheus/prometheus.yml
- job_name: 'hub2-apps'
metrics_path: /metrics
scrape_interval: 30s # Less frequent than default (15s)
scrape_timeout: 10s # Timeout if Traefik doesn't respond
file_sd_configs:
- files: ['/etc/prometheus/targets/hub2-apps.yml']
Traefik Metrics¶
Available Metrics¶
Traefik exposes the following Prometheus metrics:
Configuration Metrics¶
| Metric | Type | Description | Labels |
|---|---|---|---|
traefik_config_last_reload_success |
Gauge | Last reload success (1=success, 0=failure) | (none) |
traefik_config_last_reload_success_timestamp_seconds |
Gauge | Timestamp of last successful config reload | (none) |
Request Metrics¶
| Metric | Type | Description | Labels |
|---|---|---|---|
traefik_entrypoint_requests_total |
Counter | Total HTTP requests received | entrypoint, method, code |
traefik_entrypoint_request_duration_seconds |
Histogram | HTTP request duration in seconds | entrypoint, method, code |
traefik_entrypoint_open_connections |
Gauge | Current open connections | entrypoint |
Example Query¶
# Total requests by entrypoint
sum(rate(traefik_entrypoint_requests_total[5m])) by (entrypoint)
# Error rate (5xx percentage)
sum(rate(traefik_entrypoint_requests_total{code=~"5.."}[5m]))
/ sum(rate(traefik_entrypoint_requests_total[5m]))
# P95 response time
histogram_quantile(0.95, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le))
Alert Rules¶
Traefik Alert Rules¶
File: /opt/charliehub/monitoring/prometheus/rules/traefik-alerts.yml
TraefikMetricsDown¶
- Expression:
up{instance="traefik"} == 0 - Duration: 2 minutes
- Severity: WARNING
- Meaning: Traefik metrics endpoint is unreachable
- Action:
docker compose restart traefik docker compose ps traefik
TraefikHighErrorRate¶
- Expression:
sum(rate(traefik_entrypoint_requests_total{code=~"5.."}[5m])) / sum(rate(traefik_entrypoint_requests_total[5m])) > 0.02 - Duration: 5 minutes
- Severity: CRITICAL
- Meaning: More than 2% of requests are returning 5xx errors
- Action:
# Check for backend service failures docker ps | grep -E "prometheus|grafana|alertmanager" docker compose logs traefik | grep -i "error\|5[0-9][0-9]" | tail -20
TraefikSlowResponseTime¶
- Expression:
histogram_quantile(0.99, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le)) > 2 - Duration: 5 minutes
- Severity: WARNING
- Meaning: 99th percentile response time is over 2 seconds
- Action:
# Check database performance docker compose logs charliehub-postgres | grep -i "slow\|duration" | tail -10 # Check backend service logs docker compose logs prometheus grafana alertmanager | grep -i "slow\|timeout"
Monitoring Dashboards¶
Importing Grafana Dashboard¶
Official Traefik Dashboard (Dashboard ID: 17346)
-
Access Grafana
https://grafana.charliehub.net -
Import Dashboard
- Click: Dashboards → Import
- Enter ID:
17346 - Select Data Source: Prometheus
-
Click: Import
-
Verify Metrics
- Dashboard should show request rates
- If showing "No Data", verify:
- Prometheus scraping Traefik:
curl -s http://localhost:9090/api/v1/targets | grep traefik - Traefik is receiving requests:
curl https://grafana.charliehub.netshould create metrics
- Prometheus scraping Traefik:
Custom Queries for Grafana¶
Request Rate (req/sec)¶
sum(rate(traefik_entrypoint_requests_total[5m])) by (entrypoint)
Error Rate Percentage¶
sum(rate(traefik_entrypoint_requests_total{code=~"5.."}[5m])) by (entrypoint)
/ sum(rate(traefik_entrypoint_requests_total[5m])) by (entrypoint) * 100
Response Time Percentiles¶
# P50 (median)
histogram_quantile(0.50, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le))
# P95
histogram_quantile(0.95, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le))
# P99
histogram_quantile(0.99, sum(rate(traefik_entrypoint_request_duration_seconds_bucket[5m])) by (le))
Open Connections¶
sum(traefik_entrypoint_open_connections) by (entrypoint)
Troubleshooting¶
Traefik Metrics Endpoint Down¶
Symptom: Alert TraefikMetricsDown firing
Diagnosis:
# 1. Check if Traefik container is running
docker compose ps traefik
# 2. Check if metrics port is listening
docker exec charliehub-traefik netstat -tlnp | grep 8082
# 3. Try to access metrics endpoint
curl -s http://localhost:8082/metrics | head -20
# 4. Check Traefik logs
docker compose logs traefik | tail -50
Fixes:
# Restart Traefik
docker compose restart traefik
# If still failing, check configuration
docker exec charliehub-traefik cat /traefik/traefik.yml | grep -A 10 "metrics:"
No Metrics Data in Grafana¶
Symptom: Dashboard shows "No Data" or blank panels
Diagnosis:
# 1. Check Prometheus is scraping Traefik
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.instance=="traefik")'
# Expected: health: "up"
# 2. Check if metrics exist
curl -s http://localhost:9090/api/v1/query?query='traefik_entrypoint_requests_total' | jq '.data.result | length'
# Expected: > 0
# 3. Generate some traffic to create metrics
for i in {1..10}; do
curl -s https://grafana.charliehub.net >/dev/null 2>&1
done
# 4. Wait 30 seconds (scrape interval) and check again
sleep 30
curl -s http://localhost:9090/api/v1/query?query='traefik_entrypoint_requests_total' | jq '.data.result'
Fixes:
# 1. If target is DOWN, restart Traefik
docker compose restart traefik
# 2. If target is UP but no metrics, check if Traefik received requests
# (Metrics only appear after first request is processed)
# 3. Generate test requests
curl https://prometheus.charliehub.net/api/v1/status/flags
# 4. Wait and verify metrics appear
sleep 30
curl -s http://localhost:9090/api/v1/query?query='traefik_entrypoint_requests_total' | jq '.data.result[0]'
High Error Rate Alert¶
Symptom: Alert TraefikHighErrorRate firing
Diagnosis:
# 1. Check which backend services are having issues
docker ps | grep -E "prometheus|grafana|alertmanager"
docker compose ps
# 2. Check service logs
docker compose logs prometheus --tail 50 | grep -i "error"
docker compose logs grafana --tail 50 | grep -i "error"
# 3. Check database connectivity
docker compose logs charliehub-postgres --tail 20
docker compose logs charliehub_redis --tail 20
# 4. Check Traefik logs for routing errors
docker compose logs traefik | grep -i "error\|backend" | tail -20
Fixes:
# Restart failing service
docker compose restart prometheus grafana
# Or restart all monitoring services
docker compose restart prometheus alertmanager grafana
# Monitor error rate
watch -n 5 'curl -s "http://localhost:9090/api/v1/query?query=sum(rate(traefik_entrypoint_requests_total{code=~\"5..\"}\[5m\]))/sum(rate(traefik_entrypoint_requests_total\[5m\]))" | jq ".data.result[0].value[1]"'
Slow Response Times¶
Symptom: Alert TraefikSlowResponseTime firing
Diagnosis:
# 1. Check database performance
docker exec charliehub_postgres psql -U postgres -d charliehub -c "SELECT * FROM pg_stat_statements WHERE mean_exec_time > 1000 LIMIT 10;"
# 2. Check resource usage
docker stats --no-stream | grep -E "charliehub|prometheus|grafana"
# 3. Check if any service is running slow operations
docker compose logs prometheus | grep -i "slow\|duration" | tail -10
# 4. Profile slow endpoints
curl -w "Total: %{time_total}s, Connect: %{time_connect}s, Transfer: %{time_starttransfer}s\n" -o /dev/null -s https://grafana.charliehub.net/api/health
Fixes:
# 1. Optimize database queries (see PostgreSQL docs)
# 2. Increase resource limits if available
docker compose exec prometheus sh -c 'free -h'
# 3. Reduce data retention if Prometheus is slow
# Edit prometheus.yml:
# --storage.tsdb.retention.time=15d
# --storage.tsdb.retention.size=10GB
docker compose restart prometheus
# 4. Check for background backup/scrub operations
ps aux | grep -E "backup|scrub"
Performance Considerations¶
Cardinality Impact¶
The current Traefik configuration is safe for cardinality:
Metrics: ~15
Labels per metric: 2 (entrypoint only)
Unique label combinations: 2-5 (web, websecure)
Total time series: ~30 (very small)
Why we disabled service/router labels:
If enabled:
- Routers: ~35 unique routers
- Services: ~29 unique services
- Combined: 35 × 29 × 2 (methods) × 10 (status codes) = 20,300 time series
- Impact: 4-6 GB Prometheus memory, slow queries
Disabled (current):
- Only entrypoint labels (2 values)
- Total: ~30 time series
- Impact: <50 MB memory
Scrape Interval¶
- Current: 30 seconds
- Default: 15 seconds
- Reason: Reduced to minimize overhead; 30s is sufficient for alert detection (2+ minute delays acceptable)
Metrics Validation¶
Verify Traefik Metrics Available¶
# 1. Check raw metrics endpoint
curl -s http://charliehub-traefik:8082/metrics | head -50
# Expected output includes:
# # HELP traefik_config_last_reload_success Timestamp of last config reload
# # TYPE traefik_config_last_reload_success gauge
# traefik_config_last_reload_success 1
Verify Prometheus Scraping¶
# Check target status
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.instance=="traefik")'
# Expected:
# {
# "discoveredLabels": {...},
# "labels": {
# "instance": "traefik",
# "app": "traefik",
# "site": "ovh"
# },
# "scrapeUrl": "http://charliehub-traefik:8082/metrics",
# "lastError": "",
# "lastScrape": "2026-02-06T18:00:00Z",
# "lastScrapeDuration": 0.012,
# "health": "up"
# }
Generate Test Metrics¶
# Make requests to create metrics
for i in {1..100}; do
curl -s https://prometheus.charliehub.net >/dev/null &
done
wait
# Wait 30 seconds for scrape
sleep 30
# Query metrics
curl -s 'http://localhost:9090/api/v1/query?query=traefik_entrypoint_requests_total' | jq '.data.result[] | {metric, value: .value[1]}'
Best Practices¶
Monitoring Strategy¶
✅ Do: - Monitor request rate trends - Alert on error rate spikes - Track response time percentiles - Monitor for config reload failures
❌ Don't: - Monitor individual service metrics (causes high cardinality) - Set error rate threshold too low (causes alert fatigue) - Scrape more frequently than needed (default 15s is already frequent)
Alert Tuning¶
- High Error Rate: Threshold at 2% (threshold:
code=~"5..") - Lower = alert fatigue
-
Higher = slower incident detection
-
Slow Response: Threshold at 2 seconds (P99)
- Lower = might trigger for normal queries
- Higher = misses genuine slowness
Maintenance¶
# Weekly: Check alert rules are evaluating
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | select(.name=="traefik_health") | {name, rules: (.rules | length)}'
# Monthly: Review dashboard for trends
# - Are error rates increasing?
# - Are response times getting slower?
# - Any patterns in traffic?
References¶
Last Updated: 2026-02-06 Status: Production Ready