Troubleshooting¶
Common issues and solutions for the CharlieHub cluster.
Quick Diagnostics¶
# Cluster status
pvecm status
# All VMs/CTs status
infra list
# Check key services (all on hub2)
ssh hub2 sudo docker ps
Network Issues¶
Cannot Access Web Services¶
Symptoms: https://charliehub.net not responding
Check Traefik:
ssh root@REDACTED_IP docker ps | grep traefik
ssh root@REDACTED_IP docker logs traefik_prod --tail 20
Restart Traefik:
ssh root@REDACTED_IP 'cd /opt/charliehub/traefik-prod && docker compose restart'
DNS Not Resolving¶
Check DDNS service:
curl http://REDACTED_IP:5000/health
ssh root@REDACTED_IP docker logs charliehub_ddns_v2 --tail 20
Check current public IP:
curl -s http://REDACTED_IP:5000/health | jq .current_ipv4
Cross-Site Connectivity (UK ↔ France)¶
Check VPN tunnel:
# From UK, ping France
ping -c 3 REDACTED_IP
# From France, ping UK
ssh root@REDACTED_IP ping -c 3 REDACTED_IP
If tunnel is down: Check UniFi controller for SD-WAN status
Service Issues¶
Domain Manager Not Responding¶
# Check container
ssh root@REDACTED_IP docker ps | grep domain_manager
# Check logs
ssh root@REDACTED_IP docker logs charliehub_domain_manager_v3 --tail 50
# Restart
ssh root@REDACTED_IP 'cd /opt/charliehub/domain-manager && docker compose restart domain-manager-v3'
PostgreSQL Connection Failed¶
# Check container status
pct status 1112
# Check PostgreSQL service
pct exec 1112 -- systemctl status postgresql
# Test connection
pct exec 1112 -- psql -U postgres -c "SELECT 1"
# Restart PostgreSQL
pct exec 1112 -- systemctl restart postgresql
SSH Bastion Not Working¶
Symptoms: Cannot SSH via port 2222
# Check SSHPiper container
ssh root@REDACTED_IP docker ps | grep sshpiper
# Check logs
ssh root@REDACTED_IP docker logs charliehub_sshpiper --tail 20
# Regenerate routes for a VM
setup-vm-ssh <vmid>
Authelia (SSO) Not Working¶
# Check container
ssh root@REDACTED_IP docker ps | grep authelia
# Check health
curl http://REDACTED_IP:9091/api/health
# Check logs
ssh root@REDACTED_IP docker logs charliehub_authelia --tail 50
# Restart
ssh root@REDACTED_IP 'cd /opt/charliehub/authelia && docker compose restart'
Storage Issues¶
Ceph Pool Degraded¶
# Check pool status
ceph -s
# If disk failed, check which disk
ceph -s -v
# Check disk health
smartctl -a /dev/sdX
Disk Space Full¶
# Check disk usage
df -h
# Find large files
du -sh /* 2>/dev/null | sort -h | tail -10
# Clean old snapshots
# Old snapshots auto-cleaned by ceph-snapshot-daily.sh
# Clean old vzdump backups
ls -la /mnt/backup-storage/dump/ | head -20
Container Cannot Start - Storage Error¶
# Check storage status
pvesm status
# Check specific storage
rbd ls ceph-pool
# Try starting with verbose output
pct start <vmid> --debug
Cluster Issues¶
Node Not in Quorum¶
# Check quorum status
pvecm status
# Check corosync
corosync-quorumtool
# If node is isolated, may need to set expected votes
pvecm expected 1 # CAUTION: only if you know what you're doing
Cannot Migrate VM/CT¶
# Check if target node is available
pvecm nodes
# Check storage is available on target
ssh root@<target-node> pvesm status
# Try with debug
pct migrate <vmid> <target> --debug
Container/VM Issues¶
Container Won't Start¶
# Check status
pct status <vmid>
# View config
pct config <vmid>
# Start with debug
pct start <vmid> --debug
# Check logs
journalctl -u pve-container@<vmid>
VM Won't Start¶
# Check status
qm status <vmid>
# View config
qm config <vmid>
# Start with debug
qm start <vmid> --debug
# Check logs
journalctl -u qemu-server@<vmid>
High CPU/Memory in Container¶
# Check resource usage
pct exec <vmid> -- top -bn1 | head -20
# Check running processes
pct exec <vmid> -- ps aux --sort=-%cpu | head -10
# Update resource limits
pct set <vmid> --memory 4096 --cores 4
Container Cannot Reach Remote Hosts (WireGuard/VPN)¶
Symptom: CT can ping local hosts but not remote hosts (e.g., hub2's WireGuard IP REDACTED_IP)
Cause: Wrong netmask (/16 instead of /24) makes the CT think remote IPs are on the local LAN
Diagnosis:
# Check CT network config
pct config <vmid> | grep net0
# WRONG: ip=10.44.1.x/16 (thinks all 10.44.x.x is local)
# RIGHT: ip=10.44.1.x/24 (only 10.44.1.x is local)
# Check ARP failures (sign of wrong netmask)
pct exec <vmid> -- ip neigh show | grep FAILED
# If you see remote IPs with "FAILED", the CT is trying to ARP for them locally
# Check routing
pct exec <vmid> -- ip route show
# Gateway should be on same subnet (e.g., REDACTED_IP for 10.44.1.x/24)
Fix:
# Correct the netmask and gateway
pct set <vmid> -net0 name=eth0,bridge=vmbr0,gw=REDACTED_IP,hwaddr=<MAC>,ip=10.44.1.x/24,type=veth
pct reboot <vmid>
# Verify connectivity
pct exec <vmid> -- ping -c 2 REDACTED_IP # hub2 WireGuard IP
Note: All CTs on the 10.44.1.x subnet should use /24 netmask and REDACTED_IP gateway.
Backup Issues¶
Snapshot Script Failed¶
# Check logs
tail -100 /var/log/ceph-snapshots.log
# Check ZFS status
ceph -s
# Run manually with debug
/root/bin/ceph-snapshot-daily.sh
Vzdump Failed¶
# Check vzdump logs
cat /var/log/vzdump/vzdump-*.log
# Check storage space
df -h /mnt/backup-storage
# Check if VM is locked
qm unlock <vmid>
Emergency Procedures¶
All Services Down - Quick Recovery¶
-
Check hub2 services:
ssh hub2 sudo docker ps ssh hub2 sudo docker compose -f /opt/charliehub/docker-compose.yml ps -
If Traefik is down:
ssh hub2 'sudo docker compose -f /opt/charliehub/docker-compose.yml restart traefik' -
If all hub2 containers are down:
ssh hub2 'cd /opt/charliehub && sudo docker compose up -d' -
Verify services:
curl -s https://docs.charliehub.net > /dev/null && echo "OK" || echo "FAIL"
Cannot Access Any Node¶
- Check physical/network connectivity
- Try console access via IPMI/iDRAC if available
- Check UniFi controller for network issues
- If total site failure, access px5 in France
See Disaster Recovery for full procedures.