Platform Health Check

Use this skill to verify the complete health of the nemotron-v3 platform. Don't mark health checks complete until ALL services are healthy.

Quick Health Check

Run these commands in sequence. ALL must pass before declaring the platform healthy:

bash

# 1. Docker Compose services - ALL should be "Up" and "healthy"
docker compose -f docker-compose.prod.yml ps --format "table {{.Name}}\t{{.Status}}\t{{.Health}}"

# 2. Prometheus targets - ALL should show health: "up"
curl -s localhost:9090/api/v1/targets 2>/dev/null | jq -r '.data.activeTargets[] | "\(.labels.job): \(.health)"' | sort

# 3. API health endpoint
curl -s localhost:8000/api/health | jq

# 4. Redis connectivity
docker compose -f docker-compose.prod.yml exec -T redis redis-cli ping

# 5. PostgreSQL connectivity
docker compose -f docker-compose.prod.yml exec -T postgres pg_isready

Healthy State Definition

The platform is HEALTHY when ALL of the following are true:

Component	Healthy State
Docker Compose	All services: `Up`, `healthy` or `running`
Prometheus	All targets: `health: "up"`
API	`/api/health` returns `{"status": "healthy"}`
Redis	`PONG` response
PostgreSQL	`accepting connections`
GPU Services	YOLO26, Nemotron responding (if deployed)

Troubleshooting Workflow

If any check fails, follow this sequence:

1. Service Not Running

bash

# Check logs for the failing service
docker compose -f docker-compose.prod.yml logs --tail=100 <service-name>

# Restart the specific service
docker compose -f docker-compose.prod.yml restart <service-name>

# Re-verify
docker compose -f docker-compose.prod.yml ps

2. Prometheus Target Down

bash

# Check which targets are down
curl -s localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up") | {job: .labels.job, lastError: .lastError}'

# Common fixes:
# - Service not exposing metrics: check if /metrics endpoint exists
# - Wrong port in prometheus.yml: verify port matches service
# - Network issue: check if prometheus can reach the service

3. API Health Failing

bash

# Check API logs
docker compose -f docker-compose.prod.yml logs --tail=100 backend

# Check if API container is running
docker compose -f docker-compose.prod.yml ps backend

# Check API dependencies (Redis, Postgres)
docker compose -f docker-compose.prod.yml exec backend python -c "from backend.core.database import engine; print('DB OK')"

4. Database Issues

bash

# Check PostgreSQL logs
docker compose -f docker-compose.prod.yml logs --tail=50 postgres

# Check Redis logs
docker compose -f docker-compose.prod.yml logs --tail=50 redis

# Verify connections from backend
docker compose -f docker-compose.prod.yml exec backend python -c "
import redis
r = redis.Redis(host='redis', port=6379)
print('Redis:', r.ping())
"

Post-Fix Verification

After ANY fix, always re-run the full health check:

bash

# Full verification loop - run ALL checks again
docker compose -f docker-compose.prod.yml ps --format "table {{.Name}}\t{{.Status}}\t{{.Health}}"
curl -s localhost:9090/api/v1/targets 2>/dev/null | jq -r '.data.activeTargets[] | "\(.labels.job): \(.health)"' | sort
curl -s localhost:8000/api/health | jq

Completion Criteria

DO NOT mark the health check complete until:

• docker compose ps shows ALL services as Up/healthy
• ALL Prometheus targets show health: "up"
• API health endpoint returns success
• No ERROR level logs in recent output
• Any issues found have been FIXED and RE-VERIFIED

If you cannot achieve healthy state, document:

•Which specific checks are failing
•Error messages from logs
•What you tried
•Recommended next steps

Never leave the platform in a "confused" or partially-fixed state.

Default Prompt Template

When starting any infrastructure task, use this prompt pattern to ensure complete verification:

code

Fix the issue, then verify the entire stack is healthy. Don't stop until
`docker compose ps` shows all services Up and healthy, and all Prometheus
targets are up. List any remaining issues.

This ensures:

•Fix is not considered "done" at application
•Full verification loop runs automatically
•Any cascading issues are caught
•Final state is documented