Troubleshooting¶

Common issues and solutions for the CharlieHub cluster.

Quick Diagnostics¶

# Cluster status
pvecm status

# All VMs/CTs status
infra list

# Check key services (all on hub2)
ssh hub2 sudo docker ps

Network Issues¶

Cannot Access Web Services¶

Symptoms: https://charliehub.net not responding

Check Traefik:

ssh root@REDACTED_IP docker ps | grep traefik
ssh root@REDACTED_IP docker logs traefik_prod --tail 20

Restart Traefik:

ssh root@REDACTED_IP 'cd /opt/charliehub/traefik-prod && docker compose restart'

DNS Not Resolving¶

Check DDNS service:

curl http://REDACTED_IP:5000/health
ssh root@REDACTED_IP docker logs charliehub_ddns_v2 --tail 20

Check current public IP:

curl -s http://REDACTED_IP:5000/health | jq .current_ipv4

Cross-Site Connectivity (UK ↔ France)¶

Check VPN tunnel:

# From UK, ping France
ping -c 3 REDACTED_IP

# From France, ping UK
ssh root@REDACTED_IP ping -c 3 REDACTED_IP

If tunnel is down: Check UniFi controller for SD-WAN status

Service Issues¶

Domain Manager Not Responding¶

# Check container
ssh root@REDACTED_IP docker ps | grep domain_manager

# Check logs
ssh root@REDACTED_IP docker logs charliehub_domain_manager_v3 --tail 50

# Restart
ssh root@REDACTED_IP 'cd /opt/charliehub/domain-manager && docker compose restart domain-manager-v3'

PostgreSQL Connection Failed¶

# Check container status
pct status 1112

# Check PostgreSQL service
pct exec 1112 -- systemctl status postgresql

# Test connection
pct exec 1112 -- psql -U postgres -c "SELECT 1"

# Restart PostgreSQL
pct exec 1112 -- systemctl restart postgresql

SSH Bastion Not Working¶

Symptoms: Cannot SSH via port 2222

# Check SSHPiper container
ssh root@REDACTED_IP docker ps | grep sshpiper

# Check logs
ssh root@REDACTED_IP docker logs charliehub_sshpiper --tail 20

# Regenerate routes for a VM
setup-vm-ssh <vmid>

Authelia (SSO) Not Working¶

# Check container
ssh root@REDACTED_IP docker ps | grep authelia

# Check health
curl http://REDACTED_IP:9091/api/health

# Check logs
ssh root@REDACTED_IP docker logs charliehub_authelia --tail 50

# Restart
ssh root@REDACTED_IP 'cd /opt/charliehub/authelia && docker compose restart'

Storage Issues¶

Ceph Pool Degraded¶

# Check pool status
ceph -s

# If disk failed, check which disk
ceph -s -v

# Check disk health
smartctl -a /dev/sdX

Disk Space Full¶

# Check disk usage
df -h

# Find large files
du -sh /* 2>/dev/null | sort -h | tail -10

# Clean old snapshots
# Old snapshots auto-cleaned by ceph-snapshot-daily.sh

# Clean old vzdump backups
ls -la /mnt/backup-storage/dump/ | head -20

Container Cannot Start - Storage Error¶

# Check storage status
pvesm status

# Check specific storage
rbd ls ceph-pool

# Try starting with verbose output
pct start <vmid> --debug

Cluster Issues¶

Node Not in Quorum¶

# Check quorum status
pvecm status

# Check corosync
corosync-quorumtool

# If node is isolated, may need to set expected votes
pvecm expected 1  # CAUTION: only if you know what you're doing

Cannot Migrate VM/CT¶

# Check if target node is available
pvecm nodes

# Check storage is available on target
ssh root@<target-node> pvesm status

# Try with debug
pct migrate <vmid> <target> --debug

Container/VM Issues¶

Container Won't Start¶

# Check status
pct status <vmid>

# View config
pct config <vmid>

# Start with debug
pct start <vmid> --debug

# Check logs
journalctl -u pve-container@<vmid>

VM Won't Start¶

# Check status
qm status <vmid>

# View config
qm config <vmid>

# Start with debug
qm start <vmid> --debug

# Check logs
journalctl -u qemu-server@<vmid>

High CPU/Memory in Container¶

# Check resource usage
pct exec <vmid> -- top -bn1 | head -20

# Check running processes
pct exec <vmid> -- ps aux --sort=-%cpu | head -10

# Update resource limits
pct set <vmid> --memory 4096 --cores 4

Container Cannot Reach Remote Hosts (WireGuard/VPN)¶

Symptom: CT can ping local hosts but not remote hosts (e.g., hub2's WireGuard IP REDACTED_IP)

Cause: Wrong netmask (/16 instead of /24) makes the CT think remote IPs are on the local LAN

Diagnosis:

# Check CT network config
pct config <vmid> | grep net0
# WRONG: ip=10.44.1.x/16  (thinks all 10.44.x.x is local)
# RIGHT: ip=10.44.1.x/24  (only 10.44.1.x is local)

# Check ARP failures (sign of wrong netmask)
pct exec <vmid> -- ip neigh show | grep FAILED
# If you see remote IPs with "FAILED", the CT is trying to ARP for them locally

# Check routing
pct exec <vmid> -- ip route show
# Gateway should be on same subnet (e.g., REDACTED_IP for 10.44.1.x/24)

Fix:

# Correct the netmask and gateway
pct set <vmid> -net0 name=eth0,bridge=vmbr0,gw=REDACTED_IP,hwaddr=<MAC>,ip=10.44.1.x/24,type=veth
pct reboot <vmid>

# Verify connectivity
pct exec <vmid> -- ping -c 2 REDACTED_IP  # hub2 WireGuard IP

Note: All CTs on the 10.44.1.x subnet should use /24 netmask and REDACTED_IP gateway.

Backup Issues¶

Snapshot Script Failed¶

# Check logs
tail -100 /var/log/ceph-snapshots.log

# Check ZFS status
ceph -s

# Run manually with debug
/root/bin/ceph-snapshot-daily.sh

Vzdump Failed¶

# Check vzdump logs
cat /var/log/vzdump/vzdump-*.log

# Check storage space
df -h /mnt/backup-storage

# Check if VM is locked
qm unlock <vmid>

Emergency Procedures¶

All Services Down - Quick Recovery¶

Check hub2 services:

ssh hub2 sudo docker ps
ssh hub2 sudo docker compose -f /opt/charliehub/docker-compose.yml ps

If Traefik is down:

ssh hub2 'sudo docker compose -f /opt/charliehub/docker-compose.yml restart traefik'

If all hub2 containers are down:

ssh hub2 'cd /opt/charliehub && sudo docker compose up -d'

Verify services:

curl -s https://docs.charliehub.net > /dev/null && echo "OK" || echo "FAIL"

Cannot Access Any Node¶

Check physical/network connectivity
Try console access via IPMI/iDRAC if available
Check UniFi controller for network issues
If total site failure, access px5 in France

See Disaster Recovery for full procedures.