CharlieHub Infrastructure Architecture (2026)¶
Last Updated: February 12, 2026 Status: Production - Fully Hardened Architecture: Unified Single-Traefik ACME-Only Model
Executive Summary¶
CharlieHub is a distributed infrastructure spanning three geographic locations (UK homelab, France homelab, OVH datacenter) with a central hub (hub2) managing public-facing services, internal operations, and site-to-site VPN connectivity. The architecture emphasizes security hardening, high availability, drift-proof control planes, and unified routing.
Architecture Diagram¶
┌─────────────────────────────┐
│ Internet / Public Users │
└──────────────┬──────────────┘
│ HTTPS :443
▼
╔═════════════════════════════╗
║ HUB2 (OVH Dedicated) ║
║ 51.68.235.106 ║
╚═════════════════════════════╝
│
┌────────────────────────┼────────────────────────┐
│ │ │
▼ ▼ ▼
╔══════════════════════╗ ╔═══════════════════╗ ╔══════════════════╗
║ UNIFIED TRAEFIK ║ │ CONTROL PLANE │ │ OPERATIONAL │
║ (Single Instance) ║ │ (PostgreSQL) │ │ SERVICES │
║ 0.0.0.0:80,443 ║ │ Domain Manager │ │ │
║ 0.0.0.0:8883 (MQTT) ║ │ API │ │ Postgres │
║ 8091 (dashboard) ║ │ │ │ Redis │
║ ║ │ Route Generation │ │ Prometheus │
║ ACME certs (~23) ║ │ (traefik_gen.py) │ │ Authelia │
║ ~23 HTTP routes ║ │ │ │ Domain Mgr │
║ 1 TCP route (MQTT) ║ │ Drift-proof │ │ FileBrowser │
║ ║ │ validation │ │ UniFi API │
║ Domain Manager API ║ │ │ │ │
║ driven routing ║ │ /api/domains/* │ │ (Container │
║ ║ │ + validation │ │ Protected) │
║ Authelia SSO ║ │ │ │ │
╚══════════════════════╝ ╚═══════════════════╝ ╚══════════════════╝
│ │ │
└────────────────────────┼────────────────────────┘
│
┌──────────┴──────────┐
│ │
▼ WireGuard VPN ▼ WireGuard VPN
┌──────────────────────┐ ┌─────────────────────┐
│ UK Site (10.44.x.x) │ │ FR Site (10.35.x.x) │
│ px1, px2, px3 │ │ px5-lemans (DR) │
│ Proxmox Cluster │ │ │
│ Ceph Storage (3x) │ │ QDevice Quorum │
└──────────────────────┘ └─────────────────────┘
Core Components¶
1. Unified Traefik (Single Instance - ACME Only)¶
Location: hub2 (OVH Datacenter)
Instances: 1 (consolidated from previous dual-instance model)
Container: charliehub-traefik (Traefik v3)
Ports: 80 (HTTP redirect), 443 (HTTPS), 8883 (MQTT TCP), 8091 (dashboard)
Certificates: ACME production (Let's Encrypt) - ~23 active certificates
Architecture: Single-writer (drift-proof), file-based routes
Characteristics: - ✅ ACME certificate management (auto-renewal) - ✅ Public DNS routing (.charliehub.net, .trevarn.com, etc.) - ✅ Authelia SSO authentication integration - ✅ TCP routing support (MQTT broker on :8883) - ✅ Load balancing across ~23 HTTP services + 1 TCP service - ✅ Domain Manager API-driven (no Docker labels, no manual YAML) - ✅ Drift-proof architecture (single source of truth: PostgreSQL)
Route Management:
PostgreSQL domains table
↓
Domain Manager API (/api/domains/*)
↓
traefik_generator.py (v2.3+)
↓
/traefik/config/generated/routes.yml (atomic writes)
↓
Traefik routers (HTTP + TCP)
HTTP Routes (Sample):
auth.charliehub.net → Authelia SSO (9091)
docs.charliehub.net → MkDocs (8000)
grafana.charliehub.net → Grafana (internal, via WireGuard)
prometheus.charliehub.net → Prometheus (internal, via WireGuard)
unifi.charliehub.net → UniFi API (8002)
*.trevarn.com → Customer portals
+ ~17 additional routes
TCP Routes:
mqtt.verdegris.eu:8883 → MQTT Broker (parking-mosquitto:1883)
Monitoring: Prometheus metrics at 127.0.0.1:8091/metrics (internal only)
Key Features:
- File provider (watches /config recursively)
- Docker provider disabled (routes managed via API only)
- Atomic writes with fcntl locking
- Snapshot system with auto-pruning (100 snapshots or 30 days)
- Empty-file protection
- YAML validation
2. Domain Manager Control Plane¶
Location: hub2 (container)
Container: charliehub_domain_manager_v3
Port: 8001 (internal API)
Database: PostgreSQL 16 (charliehub_domains)
Responsibility: - PostgreSQL domains table is the source of truth - API validates all changes before persisting - Generator (traefik_generator.py) converts database → YAML routes - Enforces domain constraints (CHECK constraints in database) - Manages DDNS updates via OVH API
API Endpoints:
GET /api/domains → List all domains
POST /api/domains → Create domain
PUT /api/domains/{id} → Update domain
DELETE /api/domains/{id} → Delete domain
GET /api/deploy-all → Force route generation
Database Constraints (Enforced): - 11 CHECK constraints (backend coupling, port ranges, protocol rules) - 7 ENUM types (service_type, environment, status, protocol, etc.) - 2 Trigger functions (auto-timestamp updated_at, immutability) - TCP routes REQUIRE: protocol='tcp', tcp_entrypoint, backend_host, backend_port
3. WireGuard Site-to-Site VPN¶
Hub2 Interfaces:
wg-uk → 10.44.x.x subnets (UK homelab)
└ Peer: UK UniFi UCG
└ Allowed IPs: 10.44.0.0/16
└ Keepalive: 25s (WAN stability)
wg-fr → 10.35.x.x subnets (France homelab)
└ Peer: France UniFi DNR/UCG
└ Allowed IPs: 10.35.0.0/16
└ Keepalive: 25s (WAN stability)
└ Allowed IPs: (client assigned)
Features: - ✅ Site-to-site routing (direct access to homelab LAN) - ✅ Key rotation via wg-failover service - ✅ Failover monitor (wg-fr primary, wg-uk fallback) - ✅ Auto-restart on connectivity loss - ✅ Firewall rules persist across reboots (netfilter-persistent)
IP Resolution: - hub2 → homelab: Direct routing via WireGuard (wg-uk/wg-fr) - homelab → hub2: VPN tunnel via UCG endpoint - homelab → homelab: Direct LAN (not through hub2)
4. Security Hardening (Feb 2026)¶
File Permissions:
Python code: 444 (r--r--r--) - Read-only for all
Config files: 440 (r--r-----) - Read-only + docker group
Directories: 555 (r-xr-xr-x) - Traversable, not writable
.env files: 440 - Read-only, docker group accessible
Firewall (iptables):
✓ PostgreSQL (5432): Blocked on localhost
✓ Redis (6379): Blocked on localhost
✓ Traefik dashboard (8091):Port forwarding only via docker-proxy
✓ Port forwarding: Defined in NAT chain, persisted to /etc/iptables/rules.v4
Pre-commit Hooks:
✓ Prevents API key commits (scans for patterns)
✓ Blocks password commits (regex matching)
✓ Prevents token leaks (JWT patterns)
✓ Logged to: /var/log/charliehub/git-hooks.log
✓ Cannot be bypassed (no --no-verify allowed in operators doc)
Secret Rotation (Quarterly):
Schedule: Last Sunday of each quarter @ 02:00 UTC
Next Runs: March 29, June 28, Sept 27, Dec 27, 2026
Script: /opt/charliehub/scripts/rotate-secrets.sh
Phases: Internal → Config → External
Backups: Auto-created, auto-rollback on failure
Audit Trail:
sudo commands: sudo journalctl -u sudo
File changes: auditd rules (configured)
Code changes: git log (author tracked)
Deployment log: /var/log/charliehub/changes.log
Full report: charliehub-audit-report command
5. Operational Services¶
Container Layer:
charliehub-traefik Traefik v3 (routing)
charliehub-postgres PostgreSQL 16 (databases)
charliehub_authelia_redis Redis 7 (session storage)
charliehub_authelia Authelia (SSO/2FA)
charliehub_prometheus Prometheus (metrics)
charliehub_grafana Grafana (dashboards)
charliehub_domain_manager_v3 Domain Manager (DNS management)
charliehub_filebrowser FileBrowser (file management)
Network Isolation:
Traefik (charliehub-traefik):
→ Can reach: PostgreSQL, Redis, all services (via internal network)
→ Cannot reach: Host network directly
→ Binding: Docker bridge NAT
All containers:
→ Inside: 172.x.x.x internal subnet (docker-managed)
→ Blocked: Direct access to 127.0.0.1:5432 (PostgreSQL)
Direct access to 127.0.0.1:6379 (Redis)
Data Persistence:
/opt/charliehub/postgres/data PostgreSQL (volumes)
/opt/charliehub/authelia/redis-data Redis (volumes)
/opt/charliehub/prometheus/data Prometheus (volumes)
/opt/charliehub/monitoring/dashboards Grafana (volumes)
/opt/charliehub/domain-manager/data Domain DB (volumes)
6. Network Flow Diagram¶
Public User Access:
User (Internet)
│ HTTPS :443
▼ 51.68.235.106 (hub2 public IP)
├─ docker-proxy:443 (iptables NAT)
│ ├─ charliehub-traefik:443 (bridge network)
│ └─ Route via Domain Manager → auth, docs, api, etc.
│
└─ Response: ACME certificate (*.charliehub.net)
└─ via Authelia SSO authentication
Internal User Access (via WireGuard):
WireGuard Client (10.44.x.x or 10.35.x.x)
│ HTTPS :443 to 51.68.235.106 (hub2 public IP via WireGuard tunnel)
│ OR
│ Route directly via firewall rules (if configured)
│
▼ Traefik (unified, single instance)
├─ TLS termination
├─ Authelia SSO (if auth_required=true)
└─ Route to: grafana, prometheus, docs, unifi, domain-mgr, filebrowser, etc.
│
└─ Services behind Traefik:
├─ PostgreSQL (charliehub-postgres)
├─ Authelia (charliehub_authelia)
├─ Prometheus (charliehub_prometheus)
└─ Other containers on charliehub-internal network
TCP Routing (MQTT):
LoRaWAN Gateway (external)
│ MQTT over TLS :8883
▼ mqtt.verdegris.eu:8883
├─ Traefik TCP entrypoint (MQTT)
│
└─ parking-mosquitto:1883 (MQTT broker)
└─ Connected: 3 LoRaWAN gateways (UK + FR + US)
Resilience & High Availability¶
Firewall Rules Persistence (Feb 2026)¶
Problem Solved: Iptables rules were lost on reboot (kernel memory-only) Solution Deployed: netfilter-persistent + iptables-save
Persistent Rules:
# View all rules
sudo iptables-save | less
# Modify and persist
sudo iptables [modification]
sudo iptables-save | sudo tee /etc/iptables/rules.v4 > /dev/null
# Auto-restore on boot
systemctl status netfilter-persistent # enabled
Rules Managed: - RAW: Security rules (block direct access to services on localhost) - NAT: Docker port forwarding (automatic via docker-proxy)
Data Directory Permissions (Feb 2026)¶
Lesson: Service data dirs need write access even in hardened environments
Current Permissions:
/prometheus/data 775 (read-write for container)
/authelia/redis-data 775 (read-write for container)
/postgres/data 755 (read for host, write for container)
Python source code 444 (read-only, not modifiable)
Configuration files 440 (read-only, not modifiable)
Principle: - Code: Read-only (prevent accidental/malicious modification) - Data: Read-write (services must function) - Backups: Timestamped snapshots (audit trail)
Deployment Timeline¶
| Date | Milestone | Status |
|---|---|---|
| 2026-02-12 | WG-Easy decommissioned, Traefik unified | ✅ |
| 2026-02-12 | TCP routing unification (MQTT migrated) | ✅ |
| 2026-02-12 | Domain Manager control-plane hardening | ✅ |
| 2026-02-11 | OVH credentials restored, ACME operational | ✅ |
| 2026-02-11 | Iptables persistence deployed | ✅ |
| 2026-02-10 | WAN-Watcher DDNS integration complete | ✅ |
| 2026-02-09 | Secret rotation & pre-commit hooks deployed | ✅ |
| 2026-02-08 | Agent-proof infrastructure protection | ✅ |
| 2026-01-19 | hub2 deployed (OVH) | ✅ |
Key Infrastructure Decisions¶
Decision: Single Traefik + ACME-Only Model (Feb 2026)¶
Issue: Dual-Traefik model (public + internal CA) created architectural complexity and TLS model collision
Decision Made: Consolidate to single unified ACME-only instance - Eliminated internal CA complexity - Unified certificate management (Let's Encrypt only) - Moved 10 internal services to file-based routes with IP allowlist middleware - Original domains preserved (*.charliehub.net, zero renames)
Why This Matters: - Simpler to operate and maintain - Middleware fails closed (HTTP 403), making access control deterministic - Drift-proof architecture with single source of truth (PostgreSQL) - Extensible for future services (SSH, PostgreSQL proxies, custom TCP)
Monitoring & Observability¶
Prometheus Scrape Jobs¶
Current Jobs:
- job_name: "hub2-node"
static_configs:
- targets: ["127.0.0.1:9100"] # Node exporter
labels:
service: "hub2-infrastructure"
- job_name: "traefik"
static_configs:
- targets: ["127.0.0.1:8091"] # Traefik metrics
labels:
service: "unified-edge"
- job_name: "charliehub-postgres"
static_configs:
- targets: ["172.19.0.5:9187"] # Postgres exporter
labels:
service: "database"
Alerting Rules¶
Key alerts configured in /opt/charliehub/monitoring/prometheus/rules/:
- Traefik backend down
- PostgreSQL connection failures
- Redis memory pressure
- WireGuard connectivity
- Disk space warnings
- Certificate expiration
- MQTT gateway connectivity
Troubleshooting Quick Reference¶
Service Health¶
# Overall system health
curl http://localhost:9090/api/v1/query?query=up
# Traefik metrics
curl http://127.0.0.1:8091/metrics | grep "traefik_"
# Domain Manager status
curl http://localhost:8001/api/domains | jq 'length'
# Certificate status
curl -I https://auth.charliehub.net | grep -i "certificate"
# MQTT connectivity
docker logs parking-mosquitto --tail 10 | grep "connected"
Common Issues¶
Service unreachable:
1. Check WireGuard connection: sudo wg show
2. Verify domain exists: docker exec -i charliehub-postgres psql -U charliehub -d charliehub_domains -c "SELECT domain, status FROM domains WHERE domain='myservice.charliehub.net'"
3. Check Traefik logs: docker logs charliehub-traefik --tail 20
4. Verify routing: curl http://localhost:8091/api/http/routers | jq '.[] | select(.name | contains("myservice"))'
Certificate errors:
1. Verify ACME status: cat /opt/charliehub/traefik/certs/acme.json | jq '.letsencrypt.Certificates[].domain'
2. Check Traefik logs: docker logs charliehub-traefik 2>&1 | grep -i acme
3. Validate DNS: dig auth.charliehub.net @1.1.1.1
Routes not generating:
1. Manually trigger generation: docker exec charliehub_domain_manager_v3 python3 /app/services/traefik_generator.py
2. Check for errors in output
3. Verify database consistency: docker exec -i charliehub-postgres psql -U charliehub -d charliehub_domains -c "SELECT COUNT(*) FROM domains WHERE status='active'"
Future Improvements¶
Short-Term (Next 30 days)¶
- [ ] Implement CT1119 as dedicated VPN control plane
- [ ] Document all TCP routing examples
- [ ] Add SLA/SLO monitoring for public edge
Medium-Term (Next 90 days)¶
- [ ] Migrate Sprint 5: Public surface reduction (move admin domains to WireGuard tier)
- [ ] Implement secrets vault (instead of .env files)
- [ ] Add API rate-limiting per consumer
Long-Term (6+ months)¶
- [ ] Evaluate service mesh (Istio/Linkerd) for advanced traffic management
- [ ] Separate public/internal networks at infrastructure level
- [ ] Implement multi-region failover
See Also¶
- Cluster Overview - Infrastructure topology
- Access Setup - Connecting to services
- Services Index - Detailed service documentation
- Traefik Routing - Traefik configuration & management
- Operator How-To - Making changes safely
- Security Maintenance - Hardening details
- Standards & Governance - Control plane rules
Last updated: 2026-02-12 | Architecture: Unified ACME, drift-proof control plane