AgentSkillsCN

aws-troubleshoot

AWS 服务故障排查模式。适用于 EC2、ECS、Lambda、CloudWatch、RDS 等服务的问题诊断。

SKILL.md
--- frontmatter
name: aws-troubleshoot
description: AWS service troubleshooting patterns. Use for EC2, ECS, Lambda, CloudWatch, RDS issues.

AWS Troubleshooting Expertise

Investigation Methodology

  1. Identify the AWS resource/service involved
  2. Check resource status using describe functions
  3. Review CloudWatch logs for errors
  4. Check CloudWatch metrics for anomalies
  5. Analyze configuration for misconfigurations
  6. Synthesize and recommend

CloudWatch Logs Strategy

Partition First (CRITICAL)

Never dump all logs. Use aggregation queries first:

code
# Error rate over time
filter @message like /ERROR/
| stats count(*) as errors by bin(5m)

# Top error messages
filter @message like /Exception/
| stats count(*) by @message
| sort count desc
| limit 10

# Latency percentiles
stats pct(@duration, 50) as p50, pct(@duration, 99) as p99 by bin(5m)

# Unique error types
filter @message like /ERROR/
| parse @message /(?<error_type>[\w.]+Exception)/
| stats count(*) by error_type

Query Flow

  1. Statistics first: Get error counts, distributions
  2. Identify time windows: Find when errors spiked
  3. Sample from spikes: Get specific examples
  4. Compare to baseline: Query same period yesterday/last week

Service-Specific Patterns

EC2 Issues

SymptomFirst CheckTypical Cause
Unreachabledescribe_ec2_instanceSecurity group, stopped, status check failed
Performanceget_cloudwatch_metrics (CPUUtilization)CPU exhaustion, network saturation
Disk fullget_cloudwatch_metrics (DiskSpaceUtilization)Logs, temp files

Key CloudWatch metrics for EC2:

  • CPUUtilization
  • NetworkIn, NetworkOut
  • DiskReadOps, DiskWriteOps
  • StatusCheckFailed

Lambda Issues

SymptomFirst CheckTypical Cause
TimeoutCloudWatch logsExternal call slow, cold start, insufficient memory
Permission deniedCloudWatch logsIAM role missing permissions
Memory errorCloudWatch metricsMemory allocation too low
Cold startsCloudWatch logs + metricsProvisioned concurrency needed

Key CloudWatch metrics for Lambda:

  • Invocations
  • Duration
  • Errors
  • Throttles
  • ConcurrentExecutions

CloudWatch Insights for Lambda:

code
# Cold start analysis
filter @type = "REPORT"
| stats avg(@initDuration) as avg_cold_start,
        count(@initDuration) as cold_starts,
        count(*) as total_invocations
        by bin(5m)

# Timeout analysis
filter @message like /Task timed out/
| stats count(*) by bin(5m)

ECS/Fargate Issues

SymptomFirst CheckTypical Cause
Task failedlist_ecs_tasksContainer crash, resource limits, image pull
Service unhealthylist_ecs_tasksHealth check failing, target group issues
Slow scalingCloudWatch metricsInsufficient capacity, service limits

Investigation flow:

  1. list_ecs_tasks - See task status and health
  2. Check stopped reason in task description
  3. Review CloudWatch logs for the task
  4. Check container insights metrics

RDS Issues

SymptomFirst CheckTypical Cause
Connection refusedget_rds_instance_statusSecurity group, stopped, maintenance
Slow queriesCloudWatch metricsCPU, IOPS, connections
Storage fullCloudWatch metricsData growth, logs, snapshots

Key CloudWatch metrics for RDS:

  • CPUUtilization
  • DatabaseConnections
  • ReadIOPS, WriteIOPS
  • FreeStorageSpace
  • ReadLatency, WriteLatency

Common AWS Errors

Permission Errors

code
AccessDeniedException
UnauthorizedAccess

→ Check IAM role/policy attached to the service

Throttling

code
Throttling
Rate exceeded
TooManyRequestsException

→ Implement exponential backoff, request limit increase

Resource Not Found

code
ResourceNotFoundException
NoSuchEntity

→ Verify resource name, region, account