Production Debugging
Systematic approach to debugging production issues in Kubernetes microservice environments.
When to Use
- •Investigating HTTP 500 errors
- •Debugging missing functionality (feature works locally, fails in production)
- •Tracing requests across microservices
- •Finding silent failures (no error, but wrong behavior)
- •Service-to-service integration issues
Debugging Methodology
Step 1: Reproduce and Identify Symptoms
# What's the user seeing? # - HTTP 500 error on /workers page # - No reminder notifications # - Data saved but logs show errors # Document the symptom precisely before diving in
Step 2: Check Logs Systematically
# Start with the failing service kubectl logs deploy/<service-name> -n <namespace> --tail=100 # Filter for errors kubectl logs deploy/<service-name> -n <namespace> --tail=200 | grep -i -E "(error|exception|fail|warn)" # Check specific container in multi-container pod kubectl logs deploy/<service-name> -n <namespace> -c <container-name> --tail=100 # Common containers: # - main app container (e.g., "api", "web") # - daprd (Dapr sidecar) # - init containers (e.g., "wait-for-db")
Step 3: Trace the Request Path
For microservice issues, trace the full request path:
# 1. Frontend → API kubectl logs deploy/web-dashboard -n taskflow --tail=50 # 2. API processing kubectl logs deploy/taskflow-api -n taskflow --tail=100 | grep -i "endpoint-name" # 3. API → External service (e.g., Dapr, SSO) kubectl logs deploy/taskflow-api -n taskflow -c daprd --tail=50 # 4. Downstream service kubectl logs deploy/notification-service -n taskflow --tail=50
Step 4: Analyze the Error
Common patterns to look for:
| Error Pattern | Likely Cause |
|---|---|
AttributeError: 'X' has no attribute 'Y' | Model/schema mismatch |
404 Not Found on internal call | Wrong endpoint URL |
greenlet_spawn has not been called | Async SQLAlchemy pattern issue |
event_type: None | Message format/unwrapping issue |
| Times off by hours | Timezone handling bug |
Quick Commands
Check All Services Status
kubectl get pods -n taskflow kubectl get pods -n taskflow -o wide # With node info
Check Service Logs
# Main app logs kubectl logs deploy/taskflow-api -n taskflow --tail=100 # Dapr sidecar logs kubectl logs deploy/taskflow-api -n taskflow -c daprd --tail=100 # Follow logs in real-time kubectl logs deploy/taskflow-api -n taskflow -f # Logs from specific time kubectl logs deploy/taskflow-api -n taskflow --since=5m
Check Pod Events
kubectl describe pod <pod-name> -n taskflow kubectl get events -n taskflow --sort-by='.lastTimestamp'
Execute Commands in Pod
# Shell into pod kubectl exec -it deploy/taskflow-api -n taskflow -- /bin/sh # Run specific command kubectl exec deploy/taskflow-api -n taskflow -- env | grep DATABASE
Common Bug Patterns
1. Model/Schema Mismatch
Symptom: AttributeError: 'Model' has no attribute 'field'
Debug:
# Find the error kubectl logs deploy/taskflow-api -n taskflow --tail=100 | grep -i "attribute" # Check the model definition grep -r "class Worker" apps/api/src/
Fix: Ensure code references match actual model fields.
2. Wrong Endpoint URL
Symptom: 404 Not Found on internal service calls
Debug:
# Check what URL is being called kubectl logs deploy/taskflow-api -n taskflow -c daprd --tail=100 | grep "404" # Check what endpoints exist kubectl exec deploy/taskflow-api -n taskflow -- curl localhost:8000/openapi.json | jq '.paths | keys'
Fix: Match the callback URL to what the service exposes.
3. Timezone Bugs
Symptom: Scheduled jobs fire at wrong times (hours off)
Debug:
# Check when job was scheduled vs when it should fire kubectl logs deploy/taskflow-api -n taskflow | grep -i "scheduled" # Compare times # If local time 23:00 but scheduled for 23:00 UTC → timezone bug
Fix: Convert to UTC before storing/scheduling.
4. Message Format Issues
Symptom: Handler receives data but can't find expected fields
Debug:
# Add logging to see raw message kubectl logs deploy/notification-service -n taskflow | grep -i "raw" # Check message structure # CloudEvent wraps payload in "data" field
Fix: Unwrap CloudEvent: event = raw.get("data", raw)
5. Async SQLAlchemy Errors
Symptom: greenlet_spawn has not been called
Debug:
# Find the line that crashes kubectl logs deploy/notification-service -n taskflow | grep -A 20 "greenlet"
Fix: Add await session.refresh(obj) after commit before accessing attributes.
Debugging Dapr Specifically
Check Dapr Sidecar
# Dapr scheduler connection kubectl logs deploy/taskflow-api -n taskflow -c daprd | grep -i "scheduler" # Dapr API calls kubectl logs deploy/taskflow-api -n taskflow -c daprd | grep "HTTP API Called" # Dapr pub/sub kubectl logs deploy/taskflow-api -n taskflow -c daprd | grep -i "publish"
Check Dapr Subscriptions
# What subscriptions are registered? kubectl exec deploy/notification-service -n taskflow -- curl localhost:8001/dapr/subscribe
Test Dapr Pub/Sub
# Publish test event from inside cluster
kubectl exec deploy/taskflow-api -n taskflow -- curl -X POST \
http://localhost:3500/v1.0/publish/taskflow-pubsub/test-topic \
-H "Content-Type: application/json" \
-d '{"test": true}'
Debugging Checklist
When investigating a production issue:
- • Reproduce the issue (what exactly fails?)
- • Check pod status (
kubectl get pods) - • Check main app logs for errors
- • Check sidecar logs (daprd, etc.)
- • Trace request path across services
- • Identify error pattern (see table above)
- • Verify fix locally before deploying
- • Deploy and verify in production
CI/CD Integration
Check Deployment Status
# GitHub Actions gh run list --limit 5 # Check specific run gh run view <run-id> # Watch deployment gh run watch
Verify Deployment
# Check pod restart count (should be 0 for healthy pods) kubectl get pods -n taskflow # Check pod age (recent = just deployed) kubectl get pods -n taskflow -o wide # Verify new code is running kubectl logs deploy/taskflow-api -n taskflow --tail=10 | head -5
Prevention
Add Logging at Key Points
logger.info("[SERVICE] Received request: %s", request_summary)
logger.info("[SERVICE] Processing: step=%s, data=%s", step, safe_data)
logger.info("[SERVICE] Completed: result=%s", result_summary)
logger.error("[SERVICE] Failed: error=%s, context=%s", error, context)
Include Correlation IDs
import uuid
@router.post("/tasks")
async def create_task(request: Request):
correlation_id = request.headers.get("X-Correlation-ID", str(uuid.uuid4()))
logger.info("[%s] Creating task", correlation_id)
# ... processing ...
logger.info("[%s] Task created: %d", correlation_id, task.id)
Test Error Paths
def test_handles_missing_field():
"""Ensure graceful handling of missing data."""
response = client.post("/tasks", json={}) # Missing required field
assert response.status_code == 422 # Not 500!