Production Debugging Engineer
Overview
This skill provides systematic approaches to troubleshoot and diagnose production issues across AWS infrastructure, focusing on common services like EC2, ECS, RDS, and Application Load Balancers.
Core Capabilities
- •EC2 instance health and connectivity diagnosis
- •ECS service and task troubleshooting
- •RDS database connection and performance issues
- •ALB health check analysis and routing problems
- •CloudWatch logs and metrics analysis
- •Systems Manager (SSM) session access
Instructions
1. Initial Triage
When a user reports an issue, start with these questions:
- •Which AWS service is affected? (EC2, ECS, RDS, ALB, etc.)
- •What symptoms are observed? (service down, slow responses, errors)
- •When did the issue start?
- •Any recent changes to the infrastructure?
2. EC2 Troubleshooting Workflow
Check Instance Status:
aws ec2 describe-instance-status --instance-ids <instance-id> --profile <profile> aws ec2 describe-instances --instance-ids <instance-id> --profile <profile> --query 'Reservations[0].Instances[0].[State.Name,StateReason.Message]'
Verify Security Groups:
aws ec2 describe-security-groups --group-ids <sg-id> --profile <profile>
Check SSM Access:
aws ssm describe-instance-information --filters "Key=InstanceIds,Values=<instance-id>" --profile <profile>
If SSM is available, diagnose:
- •Disk space:
df -h - •Memory:
free -m - •CPU:
top -bn1 | head -20 - •Network connectivity:
ping -c 4 8.8.8.8
3. ECS Service Troubleshooting Workflow
Check Service Status:
aws ecs describe-services --cluster <cluster-name> --services <service-name> --profile <profile>
Check Task Status:
aws ecs describe-tasks --cluster <cluster-name> --tasks <task-id> --profile <profile>
Analyze Task Stopped Reasons:
aws ecs describe-tasks --cluster <cluster-name> --tasks <task-id> --profile <profile> --query 'tasks[0].stoppedReason'
Check CloudWatch Logs:
aws logs tail /ecs/<service-name> --since 30m --follow --profile <profile>
Common Issues:
- •Task failing health checks: Check target group health in ALB
- •Tasks stuck in PENDING: Check capacity, IAM roles, ENI limits
- •Tasks stopping immediately: Check container logs for application errors
4. RDS Troubleshooting Workflow
Check Database Status:
aws rds describe-db-instances --db-instance-identifier <db-id> --profile <profile> --query 'DBInstances[0].[DBInstanceStatus,Endpoint.Address,Endpoint.Port]'
Check Recent Events:
aws rds describe-events --source-identifier <db-id> --source-type db-instance --duration 1440 --profile <profile>
Verify Security Groups:
aws rds describe-db-instances --db-instance-identifier <db-id> --profile <profile> --query 'DBInstances[0].VpcSecurityGroups'
Check Connectivity from EC2:
- •Connect via SSM to an EC2 in same VPC
- •Test:
telnet <rds-endpoint> 3306(or appropriate port)
Common Issues:
- •Connection timeouts: Check security groups, NACLs, route tables
- •Authentication failures: Verify credentials, IAM authentication settings
- •High CPU/Memory: Check CloudWatch metrics for resource exhaustion
5. ALB Troubleshooting Workflow
Check ALB Status:
aws elbv2 describe-load-balancers --names <alb-name> --profile <profile>
Check Target Group Health:
aws elbv2 describe-target-health --target-group-arn <tg-arn> --profile <profile>
Analyze Unhealthy Targets:
- •Check health check configuration
- •Verify target instance/task is running
- •Test health check endpoint directly
Check ALB Metrics (Last Hour):
aws cloudwatch get-metric-statistics \ --namespace AWS/ApplicationELB \ --metric-name UnHealthyHostCount \ --dimensions Name=LoadBalancer,Value=<alb-name> \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \ --period 300 \ --statistics Average \ --profile <profile>
Common Issues:
- •502/503 errors: Target unhealthy or not responding
- •504 errors: Timeout - check application response time
- •Health check failures: Verify path, timeout, and interval settings
6. CloudWatch Logs Analysis
Search for errors in log group:
aws logs filter-log-events \ --log-group-name <log-group> \ --filter-pattern "ERROR" \ --start-time $(date -d '1 hour ago' +%s)000 \ --profile <profile> \ --max-items 50
Tail logs in real-time:
aws logs tail <log-group> --since 10m --follow --profile <profile>
7. Multi-Account Troubleshooting
When operating across multiple accounts:
- •Always call
get_aws_credentialsfor each account - •Use returned profile name with
--profileflag - •Clearly label which account each finding belongs to
- •Compare configurations across accounts if investigating cross-account issues
Best Practices
- •Start broad, then narrow: Begin with service-level checks, then drill into specifics
- •Check recent changes: Always review recent deployments, config changes, or infrastructure updates
- •Use metrics: CloudWatch metrics often reveal issues before logs
- •Document findings: Clearly state what was checked, what was found, and what to investigate next
- •Read-only mode: Remember you cannot make changes - only diagnose and recommend fixes
Example Workflow
User: "My ALB is returning 503 errors"
- •Check ALB status and target groups
- •Check target health - identify unhealthy targets
- •If targets are ECS tasks: check task status, logs, and container health
- •If targets are EC2: check instance status, security groups, application logs
- •Check CloudWatch metrics for pattern (sporadic vs constant)
- •Provide diagnosis and recommended remediation steps