Skip to content

CharlieHub Infrastructure Architecture (2026)

Last Updated: February 12, 2026 Status: Production - Fully Hardened Architecture: Unified Single-Traefik ACME-Only Model

Executive Summary

CharlieHub is a distributed infrastructure spanning three geographic locations (UK homelab, France homelab, OVH datacenter) with a central hub (hub2) managing public-facing services, internal operations, and site-to-site VPN connectivity. The architecture emphasizes security hardening, high availability, drift-proof control planes, and unified routing.


Architecture Diagram

                                  ┌─────────────────────────────┐
                                  │  Internet / Public Users    │
                                  └──────────────┬──────────────┘
                                                 │ HTTPS :443
                                                 ▼
                                   ╔═════════════════════════════╗
                                   ║   HUB2 (OVH Dedicated)      ║
                                   ║    51.68.235.106            ║
                                   ╚═════════════════════════════╝
                                                 │
                        ┌────────────────────────┼────────────────────────┐
                        │                        │                        │
                        ▼                        ▼                        ▼
            ╔══════════════════════╗  ╔═══════════════════╗  ╔══════════════════╗
            ║  UNIFIED TRAEFIK     ║  │ CONTROL PLANE     │  │  OPERATIONAL     │
            ║  (Single Instance)   ║  │ (PostgreSQL)      │  │  SERVICES        │
            ║  0.0.0.0:80,443      ║  │ Domain Manager    │  │                  │
            ║  0.0.0.0:8883 (MQTT) ║  │ API               │  │ Postgres         │
            ║  8091 (dashboard)    ║  │                   │  │ Redis            │
            ║                      ║  │ Route Generation  │  │ Prometheus       │
            ║ ACME certs (~23)     ║  │ (traefik_gen.py) │  │ Authelia         │
            ║ ~23 HTTP routes      ║  │                   │  │ Domain Mgr       │
            ║  1 TCP route (MQTT)  ║  │ Drift-proof       │  │ FileBrowser      │
            ║                      ║  │ validation        │  │ UniFi API        │
            ║ Domain Manager API   ║  │                   │  │                  │
            ║ driven routing       ║  │ /api/domains/*    │  │ (Container       │
            ║                      ║  │ + validation      │  │  Protected)      │
            ║ Authelia SSO         ║  │                   │  │                  │
            ╚══════════════════════╝  ╚═══════════════════╝  ╚══════════════════╝
                        │                        │                        │
                        └────────────────────────┼────────────────────────┘
                                                 │
                                      ┌──────────┴──────────┐
                                      │                     │
                                      ▼ WireGuard VPN      ▼ WireGuard VPN
                          ┌──────────────────────┐   ┌─────────────────────┐
                          │ UK Site (10.44.x.x)  │   │ FR Site (10.35.x.x) │
                          │ px1, px2, px3        │   │ px5-lemans (DR)     │
                          │ Proxmox Cluster      │   │                     │
                          │ Ceph Storage (3x)    │   │ QDevice Quorum      │
                          └──────────────────────┘   └─────────────────────┘

Core Components

1. Unified Traefik (Single Instance - ACME Only)

Location: hub2 (OVH Datacenter) Instances: 1 (consolidated from previous dual-instance model) Container: charliehub-traefik (Traefik v3) Ports: 80 (HTTP redirect), 443 (HTTPS), 8883 (MQTT TCP), 8091 (dashboard) Certificates: ACME production (Let's Encrypt) - ~23 active certificates Architecture: Single-writer (drift-proof), file-based routes

Characteristics: - ✅ ACME certificate management (auto-renewal) - ✅ Public DNS routing (.charliehub.net, .trevarn.com, etc.) - ✅ Authelia SSO authentication integration - ✅ TCP routing support (MQTT broker on :8883) - ✅ Load balancing across ~23 HTTP services + 1 TCP service - ✅ Domain Manager API-driven (no Docker labels, no manual YAML) - ✅ Drift-proof architecture (single source of truth: PostgreSQL)

Route Management:

PostgreSQL domains table
         ↓
Domain Manager API (/api/domains/*)
         ↓
traefik_generator.py (v2.3+)
         ↓
/traefik/config/generated/routes.yml (atomic writes)
         ↓
Traefik routers (HTTP + TCP)

HTTP Routes (Sample):

auth.charliehub.net             → Authelia SSO (9091)
docs.charliehub.net             → MkDocs (8000)
grafana.charliehub.net          → Grafana (internal, via WireGuard)
prometheus.charliehub.net       → Prometheus (internal, via WireGuard)
unifi.charliehub.net            → UniFi API (8002)
*.trevarn.com                   → Customer portals
+ ~17 additional routes

TCP Routes:

mqtt.verdegris.eu:8883          → MQTT Broker (parking-mosquitto:1883)

Monitoring: Prometheus metrics at 127.0.0.1:8091/metrics (internal only)

Key Features: - File provider (watches /config recursively) - Docker provider disabled (routes managed via API only) - Atomic writes with fcntl locking - Snapshot system with auto-pruning (100 snapshots or 30 days) - Empty-file protection - YAML validation


2. Domain Manager Control Plane

Location: hub2 (container) Container: charliehub_domain_manager_v3 Port: 8001 (internal API) Database: PostgreSQL 16 (charliehub_domains)

Responsibility: - PostgreSQL domains table is the source of truth - API validates all changes before persisting - Generator (traefik_generator.py) converts database → YAML routes - Enforces domain constraints (CHECK constraints in database) - Manages DDNS updates via OVH API

API Endpoints:

GET  /api/domains               → List all domains
POST /api/domains               → Create domain
PUT  /api/domains/{id}          → Update domain
DELETE /api/domains/{id}        → Delete domain
GET  /api/deploy-all            → Force route generation

Database Constraints (Enforced): - 11 CHECK constraints (backend coupling, port ranges, protocol rules) - 7 ENUM types (service_type, environment, status, protocol, etc.) - 2 Trigger functions (auto-timestamp updated_at, immutability) - TCP routes REQUIRE: protocol='tcp', tcp_entrypoint, backend_host, backend_port


3. WireGuard Site-to-Site VPN

Hub2 Interfaces:

wg-uk   → 10.44.x.x subnets (UK homelab)
         └ Peer: UK UniFi UCG
         └ Allowed IPs: 10.44.0.0/16
         └ Keepalive: 25s (WAN stability)

wg-fr   → 10.35.x.x subnets (France homelab)
         └ Peer: France UniFi DNR/UCG
         └ Allowed IPs: 10.35.0.0/16
         └ Keepalive: 25s (WAN stability)

         └ Allowed IPs: (client assigned)

Features: - ✅ Site-to-site routing (direct access to homelab LAN) - ✅ Key rotation via wg-failover service - ✅ Failover monitor (wg-fr primary, wg-uk fallback) - ✅ Auto-restart on connectivity loss - ✅ Firewall rules persist across reboots (netfilter-persistent)

IP Resolution: - hub2 → homelab: Direct routing via WireGuard (wg-uk/wg-fr) - homelab → hub2: VPN tunnel via UCG endpoint - homelab → homelab: Direct LAN (not through hub2)


4. Security Hardening (Feb 2026)

File Permissions:

Python code:    444 (r--r--r--)  - Read-only for all
Config files:   440 (r--r-----)  - Read-only + docker group
Directories:    555 (r-xr-xr-x)  - Traversable, not writable
.env files:     440              - Read-only, docker group accessible

Firewall (iptables):

✓ PostgreSQL (5432):       Blocked on localhost
✓ Redis (6379):            Blocked on localhost
✓ Traefik dashboard (8091):Port forwarding only via docker-proxy
✓ Port forwarding:         Defined in NAT chain, persisted to /etc/iptables/rules.v4

Pre-commit Hooks:

✓ Prevents API key commits    (scans for patterns)
✓ Blocks password commits     (regex matching)
✓ Prevents token leaks        (JWT patterns)
✓ Logged to:                  /var/log/charliehub/git-hooks.log
✓ Cannot be bypassed (no --no-verify allowed in operators doc)

Secret Rotation (Quarterly):

Schedule:  Last Sunday of each quarter @ 02:00 UTC
Next Runs: March 29, June 28, Sept 27, Dec 27, 2026
Script:    /opt/charliehub/scripts/rotate-secrets.sh
Phases:    Internal → Config → External
Backups:   Auto-created, auto-rollback on failure

Audit Trail:

sudo commands:      sudo journalctl -u sudo
File changes:       auditd rules (configured)
Code changes:       git log (author tracked)
Deployment log:     /var/log/charliehub/changes.log
Full report:        charliehub-audit-report command


5. Operational Services

Container Layer:

charliehub-traefik            Traefik v3 (routing)
charliehub-postgres           PostgreSQL 16 (databases)
charliehub_authelia_redis     Redis 7 (session storage)
charliehub_authelia           Authelia (SSO/2FA)
charliehub_prometheus         Prometheus (metrics)
charliehub_grafana            Grafana (dashboards)
charliehub_domain_manager_v3  Domain Manager (DNS management)
charliehub_filebrowser        FileBrowser (file management)

Network Isolation:

Traefik (charliehub-traefik):
  → Can reach: PostgreSQL, Redis, all services (via internal network)
  → Cannot reach: Host network directly
  → Binding: Docker bridge NAT

All containers:
  → Inside: 172.x.x.x internal subnet (docker-managed)
  → Blocked: Direct access to 127.0.0.1:5432 (PostgreSQL)
             Direct access to 127.0.0.1:6379 (Redis)

Data Persistence:

/opt/charliehub/postgres/data           PostgreSQL (volumes)
/opt/charliehub/authelia/redis-data     Redis (volumes)
/opt/charliehub/prometheus/data         Prometheus (volumes)
/opt/charliehub/monitoring/dashboards   Grafana (volumes)
/opt/charliehub/domain-manager/data     Domain DB (volumes)


6. Network Flow Diagram

Public User Access:

User (Internet)
  │ HTTPS :443
  ▼ 51.68.235.106 (hub2 public IP)
  ├─ docker-proxy:443 (iptables NAT)
  │ ├─ charliehub-traefik:443 (bridge network)
  │ └─ Route via Domain Manager → auth, docs, api, etc.
  │
  └─ Response: ACME certificate (*.charliehub.net)
     └─ via Authelia SSO authentication

Internal User Access (via WireGuard):

WireGuard Client (10.44.x.x or 10.35.x.x)
  │ HTTPS :443 to 51.68.235.106 (hub2 public IP via WireGuard tunnel)
  │ OR
  │ Route directly via firewall rules (if configured)
  │
  ▼ Traefik (unified, single instance)
  ├─ TLS termination
  ├─ Authelia SSO (if auth_required=true)
  └─ Route to: grafana, prometheus, docs, unifi, domain-mgr, filebrowser, etc.
     │
     └─ Services behind Traefik:
        ├─ PostgreSQL (charliehub-postgres)
        ├─ Authelia (charliehub_authelia)
        ├─ Prometheus (charliehub_prometheus)
        └─ Other containers on charliehub-internal network

TCP Routing (MQTT):

LoRaWAN Gateway (external)
  │ MQTT over TLS :8883
  ▼ mqtt.verdegris.eu:8883
  ├─ Traefik TCP entrypoint (MQTT)
  │
  └─ parking-mosquitto:1883 (MQTT broker)
     └─ Connected: 3 LoRaWAN gateways (UK + FR + US)


Resilience & High Availability

Firewall Rules Persistence (Feb 2026)

Problem Solved: Iptables rules were lost on reboot (kernel memory-only) Solution Deployed: netfilter-persistent + iptables-save

Persistent Rules:

# View all rules
sudo iptables-save | less

# Modify and persist
sudo iptables [modification]
sudo iptables-save | sudo tee /etc/iptables/rules.v4 > /dev/null

# Auto-restore on boot
systemctl status netfilter-persistent  # enabled

Rules Managed: - RAW: Security rules (block direct access to services on localhost) - NAT: Docker port forwarding (automatic via docker-proxy)

Data Directory Permissions (Feb 2026)

Lesson: Service data dirs need write access even in hardened environments

Current Permissions:

/prometheus/data          775 (read-write for container)
/authelia/redis-data      775 (read-write for container)
/postgres/data            755 (read for host, write for container)
Python source code        444 (read-only, not modifiable)
Configuration files       440 (read-only, not modifiable)

Principle: - Code: Read-only (prevent accidental/malicious modification) - Data: Read-write (services must function) - Backups: Timestamped snapshots (audit trail)


Deployment Timeline

Date Milestone Status
2026-02-12 WG-Easy decommissioned, Traefik unified
2026-02-12 TCP routing unification (MQTT migrated)
2026-02-12 Domain Manager control-plane hardening
2026-02-11 OVH credentials restored, ACME operational
2026-02-11 Iptables persistence deployed
2026-02-10 WAN-Watcher DDNS integration complete
2026-02-09 Secret rotation & pre-commit hooks deployed
2026-02-08 Agent-proof infrastructure protection
2026-01-19 hub2 deployed (OVH)

Key Infrastructure Decisions

Decision: Single Traefik + ACME-Only Model (Feb 2026)

Issue: Dual-Traefik model (public + internal CA) created architectural complexity and TLS model collision

Decision Made: Consolidate to single unified ACME-only instance - Eliminated internal CA complexity - Unified certificate management (Let's Encrypt only) - Moved 10 internal services to file-based routes with IP allowlist middleware - Original domains preserved (*.charliehub.net, zero renames)

Why This Matters: - Simpler to operate and maintain - Middleware fails closed (HTTP 403), making access control deterministic - Drift-proof architecture with single source of truth (PostgreSQL) - Extensible for future services (SSH, PostgreSQL proxies, custom TCP)


Monitoring & Observability

Prometheus Scrape Jobs

Current Jobs:

- job_name: "hub2-node"
  static_configs:
    - targets: ["127.0.0.1:9100"]  # Node exporter
      labels:
        service: "hub2-infrastructure"

- job_name: "traefik"
  static_configs:
    - targets: ["127.0.0.1:8091"]  # Traefik metrics
      labels:
        service: "unified-edge"

- job_name: "charliehub-postgres"
  static_configs:
    - targets: ["172.19.0.5:9187"]  # Postgres exporter
      labels:
        service: "database"

Alerting Rules

Key alerts configured in /opt/charliehub/monitoring/prometheus/rules/: - Traefik backend down - PostgreSQL connection failures - Redis memory pressure - WireGuard connectivity - Disk space warnings - Certificate expiration - MQTT gateway connectivity


Troubleshooting Quick Reference

Service Health

# Overall system health
curl http://localhost:9090/api/v1/query?query=up

# Traefik metrics
curl http://127.0.0.1:8091/metrics | grep "traefik_"

# Domain Manager status
curl http://localhost:8001/api/domains | jq 'length'

# Certificate status
curl -I https://auth.charliehub.net | grep -i "certificate"

# MQTT connectivity
docker logs parking-mosquitto --tail 10 | grep "connected"

Common Issues

Service unreachable: 1. Check WireGuard connection: sudo wg show 2. Verify domain exists: docker exec -i charliehub-postgres psql -U charliehub -d charliehub_domains -c "SELECT domain, status FROM domains WHERE domain='myservice.charliehub.net'" 3. Check Traefik logs: docker logs charliehub-traefik --tail 20 4. Verify routing: curl http://localhost:8091/api/http/routers | jq '.[] | select(.name | contains("myservice"))'

Certificate errors: 1. Verify ACME status: cat /opt/charliehub/traefik/certs/acme.json | jq '.letsencrypt.Certificates[].domain' 2. Check Traefik logs: docker logs charliehub-traefik 2>&1 | grep -i acme 3. Validate DNS: dig auth.charliehub.net @1.1.1.1

Routes not generating: 1. Manually trigger generation: docker exec charliehub_domain_manager_v3 python3 /app/services/traefik_generator.py 2. Check for errors in output 3. Verify database consistency: docker exec -i charliehub-postgres psql -U charliehub -d charliehub_domains -c "SELECT COUNT(*) FROM domains WHERE status='active'"


Future Improvements

Short-Term (Next 30 days)

  • [ ] Implement CT1119 as dedicated VPN control plane
  • [ ] Document all TCP routing examples
  • [ ] Add SLA/SLO monitoring for public edge

Medium-Term (Next 90 days)

  • [ ] Migrate Sprint 5: Public surface reduction (move admin domains to WireGuard tier)
  • [ ] Implement secrets vault (instead of .env files)
  • [ ] Add API rate-limiting per consumer

Long-Term (6+ months)

  • [ ] Evaluate service mesh (Istio/Linkerd) for advanced traffic management
  • [ ] Separate public/internal networks at infrastructure level
  • [ ] Implement multi-region failover

See Also


Last updated: 2026-02-12 | Architecture: Unified ACME, drift-proof control plane