Dockhand Troubleshoot
Systematic debugging guide for infrastructure issues. Supports quick triage for common problems and deep diagnostic chains for complex issues.
Quick Triage Mode
For common errors, run targeted diagnostics:
HTTP 502/504 (Bad Gateway)
code
1. traefik_status → Check if router exists for domain 2. dokploy_app_status → Verify app is running 3. ssh_exec "docker logs <container> --tail 50" → Check app logs
Common fixes:
- •Container crashed →
dokploy_redeploy - •Wrong port → Update Traefik config
- •App startup failure → Check env vars
Certificate Errors (HTTPS)
code
1. traefik_check_certs domain="<domain>" → Check cert status 2. dns_get_record name="<domain>" → Verify DNS points to Traefik 3. ssh_exec "docker logs traefik --tail 100 | grep acme" → ACME logs
Common fixes:
- •DNS not propagated → Wait, check
dns_check_propagation - •Rate limited → Wait 1 hour, use staging CA
- •Wrong challenge type → Verify HTTP-01 port 80 open
Cloudflare 521 (Origin Down)
code
1. ssh_exec "curl -I localhost:<port>" on platform-core → Test local 2. traefik_status → Verify Traefik running 3. ssh_exec "docker ps | grep traefik" → Container status
Common fixes:
- •Traefik not running → Restart via Dokploy
- •Firewall blocking → Check Hetzner firewall rules
- •Network misconfiguration → Verify dokploy-network
App Not Starting
code
1. dokploy_app_status → Get deployment status 2. ssh_exec "docker logs <container> --tail 100" → Startup logs 3. ssh_exec "docker inspect <container>" → Check config
Common fixes:
- •Missing env var → Add via Dokploy UI
- •Volume permission → Fix ownership
- •Resource exhaustion → Check
check_resource_thresholds
Deep Diagnostic Chain
For complex issues, run full diagnostic sequence:
Step 1: Infrastructure Health
code
check_resource_thresholds → All hosts CPU/memory/disk monitoring_health → Prometheus/Grafana/Loki status
Step 2: Edge Layer (Traefik)
code
traefik_status → All routers and services traefik_check_certs → All certificate status ssh_exec "docker logs traefik --tail 200" → Recent logs
Step 3: Orchestration Layer (Dokploy)
code
dokploy_list_apps → All managed applications dokploy_app_status app_id=<id> → Specific app details
Step 4: Container Layer (Docker)
code
docker_state host="<host>" → Full container inventory ssh_exec "docker ps -a" → Include stopped containers ssh_exec "docker logs <container>" → Application logs
Step 5: Network Layer
code
ssh_exec "docker network ls" → Network inventory ssh_exec "docker network inspect dokploy-network" → Connectivity dns_list_records → DNS configuration dns_check_propagation domain="<domain>" → External resolution
Step 6: Application Layer
code
ssh_exec "curl -I http://localhost:<port>" → Local health ssh_exec "curl -I https://<domain>" → External health
Diagnostic Output Format
Present findings in structured format:
code
=== Troubleshooting Report === Issue: <user-reported problem> Timestamp: <when started> ## Quick Checks - [ ] Host connectivity: OK - [ ] Traefik status: OK - [ ] App status: FAILED - container restarting - [ ] DNS resolution: OK - [ ] Certificate: OK ## Root Cause Container `ghost` in restart loop due to missing DATABASE_URL ## Recommended Fix 1. Add DATABASE_URL to Dokploy environment variables 2. Redeploy application 3. Verify health post-deploy ## Commands to Execute dokploy_redeploy app_id="xxx" confirmed=true
Escalation Path
If quick triage doesn't resolve:
- •Run full diagnostic chain
- •Check Grafana dashboards for historical data
- •Review Loki logs for patterns
- •Check recent changes (git log, Dokploy activity)
- •Escalate to manual SSH investigation
Integration with dh-deploy-validator
After fixing issues, trigger dh-deploy-validator agent to confirm resolution and prevent regression.