Skip to content

Emergency Procedures

What to do when things go wrong.

API is Down, Database is Up

Symptom: Can't reach the API, but PostgreSQL is accessible

Safe approach: Modify database directly (documented)

# Step 1: Verify what's wrong
docker exec -i charliehub-postgres psql -U charliehub -d charliehub_domains \
  -c "SELECT domain, status FROM domains WHERE status='active' LIMIT 5"

# Step 2: Make the fix in the database
docker exec -i charliehub-postgres psql -U charliehub -d charliehub_domains \
  -c "UPDATE domains SET status='active' WHERE domain='critical-service.com'"

# Step 3: Regenerate configuration
docker exec charliehub_domain_manager_v3 python3 /app/services/traefik_generator.py

# Step 4: Verify Traefik reloaded
curl -s http://localhost:8091/api/http/routers | jq '.[] | select(.name | contains("critical"))'

# Step 5: Document the incident
echo "$(date): Emergency fix - updated domains via SQL due to API downtime" >> /var/log/emergency.log

Key: Database is the source of truth. Fixing there is safe.


Configuration Got Corrupted

Symptom: Routes are wrong, config is inconsistent

Recovery: Restore from snapshot

# CharlieHub uses automatic snapshots
ls -la /opt/charliehub/traefik/config/history/ | head -5

# Find the last good snapshot (before corruption)
# Restore it
sudo cp /opt/charliehub/traefik/config/history/2026-02-12_15-01-12.yml \
        /opt/charliehub/traefik/config/generated/routes.yml

# Verify Traefik reloaded with the good config
curl -s http://localhost:8091/api/http/routers | wc -l

Then:

# Figure out what went wrong with the database
docker exec -i charliehub-postgres psql -U charliehub -d charliehub_domains \
  -c "SELECT * FROM domains WHERE status='active' ORDER BY updated_at DESC LIMIT 10"

# Fix the database issue
curl -X PUT /api/domains/PROBLEM_ID -d '{...}'

# Regenerate
docker exec charliehub_domain_manager_v3 python3 /app/services/traefik_generator.py


Someone Made Direct YAML Edits

Symptom: Routes.yml was modified directly, changes not reflected in database

Recovery:

# Step 1: Restore the generated file
git checkout /opt/charliehub/traefik/config/generated/routes.yml

# Step 2: Figure out what they were trying to do
git log --oneline -5

# Step 3: Do it the right way
# Ask: "What change were you trying to make?"
# Answer: Use the API

# Step 4: Regenerate
docker exec charliehub_domain_manager_v3 python3 /app/services/traefik_generator.py

Someone Inserted Directly into Database

Symptom: Route appears in database but not in Traefik

Recovery:

# Step 1: Find the bad row
docker exec -i charliehub-postgres psql -U charliehub -d charliehub_domains \
  -c "SELECT * FROM domains WHERE domain='unknown-domain.com'"

# Step 2: Check if it's valid (constraints check)
# If INSERT succeeded despite being invalid, constraints are broken

# Step 3: Delete the bad row
docker exec -i charliehub-postgres psql -U charliehub -d charliehub_domains \
  -c "DELETE FROM domains WHERE domain='unknown-domain.com'"

# Step 4: Verify constraints are working
# Try to insert an invalid row:
docker exec -i charliehub-postgres psql -U charliehub -d charliehub_domains \
  -c "INSERT INTO domains (protocol='tcp', cors_enabled=true) ..."
# Should get: ERROR: new row violates check constraint

# Step 5: Educate the user
# "Use the API, it validates this stuff"

Traefik Won't Start

Symptom: docker logs charliehub-traefik shows errors

Common causes:

Bad YAML in routes.yml

# Validate the YAML
docker run -v /opt/charliehub/traefik/config/generated:/data \
  -it mikefarah/yq eval '/data/routes.yml'

# If it fails, restore snapshot
sudo cp /opt/charliehub/traefik/config/history/LAST_GOOD.yml \
        /opt/charliehub/traefik/config/generated/routes.yml

# Restart Traefik
docker restart charliehub-traefik

Configuration Constraint Violation

# Check the generated file for invalid configs
cat /opt/charliehub/traefik/config/generated/routes.yml | grep -A 5 "ERROR"

# If the generator created bad config:
# 1. Find the bad domain entry
docker exec -i charliehub-postgres psql -U charliehub -d charliehub_domains \
  -c "SELECT domain, status FROM domains WHERE status='active'"

# 2. Fix it via API
curl -X PUT /api/domains/ID -d '{...}'

# 3. Regenerate
docker exec charliehub_domain_manager_v3 python3 /app/services/traefik_generator.py

# 4. Restart Traefik
docker restart charliehub-traefik

Database Constraints Are Too Strict

Symptom: Valid-seeming config gets rejected by constraint

Options:

Option 1: The Constraint is Right, Config is Wrong

# Database constraint: TCP routes must have backend_host + backend_port
# You're trying: protocol='tcp' without backend_host

# This is correct behavior - your config is invalid
# Fix: Add backend_host, or change to protocol='http'

Option 2: The Constraint is Wrong

# The constraint is preventing legitimate config
# Solution: Update the constraint

# Step 1: Review the constraint
docker exec -i charliehub-postgres psql -U charliehub -d charliehub_domains \
  -c "SELECT constraint_name, constraint_definition FROM information_schema.check_constraints WHERE table_name='domains'"

# Step 2: Modify it
ALTER TABLE domains DROP CONSTRAINT bad_constraint;
ALTER TABLE domains ADD CONSTRAINT new_constraint CHECK (...);

# Step 3: Document why you changed it
git commit -m "fix(database): Relaxed constraint X because Y"

# Step 4: Update standards documentation
# Edit /docs/standards/ and explain the new rule

Incident Checklist

When something goes wrong:

  • [ ] Immediate: Identify symptoms
  • [ ] Assess: Is API down? Database? Traefik?
  • [ ] Stabilize: Restore from snapshot if needed
  • [ ] Investigate: What went wrong?
  • [ ] Fix: Make change at the source (database/API)
  • [ ] Verify: Test that it works
  • [ ] Prevent: How to prevent this in future?
  • [ ] Document: Log what happened and why
  • [ ] Post-mortem: Why did the safeguards not catch this?

Post-Incident Questions

After any incident, ask:

  1. Was this prevented by a safeguard? (Constraint, API validation, doc)
  2. If no → Add the missing safeguard

  3. Was this caught early? (Git hook, pre-commit, log monitoring)

  4. If no → Add monitoring

  5. Could it happen again? (Same root cause)

  6. If yes → Fix the root cause, not the symptom

  7. Did we follow the standards? (10 commandments)

  8. If no → Reinforce training on standards

When to Escalate

Contact your team lead if:

  • You had to modify the database directly
  • A constraint needed to be changed
  • The API doesn't support what you need (needs extension)
  • The documentation is unclear or contradictory
  • You're about to bypass the system with a workaround

Don't: Just make a workaround and move on Do: Escalate to get it extended properly


Key Principles

  1. Database is source of truth - Fix problems there
  2. API validates changes - Use it for normal operations
  3. Generate from source - Never edit output files
  4. Snapshots are backups - Know how to restore
  5. Constraints prevent corruption - Work with them, not around them
  6. Document incidents - Learn from what went wrong