AWS CloudWatch Skill
Set up comprehensive monitoring and alerting for AWS resources.
Quick Reference
| Attribute | Value |
|---|---|
| AWS Service | CloudWatch |
| Complexity | Medium |
| Est. Time | 15-30 min |
| Prerequisites | Resources to monitor |
Parameters
Required
| Parameter | Type | Description | Validation |
|---|---|---|---|
| namespace | string | Metric namespace | AWS/* or custom |
| metric_name | string | Metric name | Valid metric |
| resource_id | string | Resource identifier | Valid ARN or ID |
Optional
| Parameter | Type | Default | Description |
|---|---|---|---|
| period | int | 300 | Evaluation period (seconds) |
| statistic | string | Average | Average, Sum, Min, Max, p99 |
| threshold | float | varies | Alert threshold |
| evaluation_periods | int | 3 | Consecutive periods |
Essential Alarms
EC2 Alarms
yaml
- name: HighCPU metric: CPUUtilization threshold: 80 period: 300 evaluation_periods: 3 - name: StatusCheckFailed metric: StatusCheckFailed threshold: 1 period: 60 evaluation_periods: 2
ECS Alarms
yaml
- name: HighCPU metric: CPUUtilization threshold: 80 - name: HighMemory metric: MemoryUtilization threshold: 85 - name: RunningTaskCount metric: RunningTaskCount threshold: 1 comparison: LessThan
RDS Alarms
yaml
- name: HighCPU metric: CPUUtilization threshold: 80 - name: LowFreeStorage metric: FreeStorageSpace threshold: 10737418240 # 10GB comparison: LessThan - name: HighConnections metric: DatabaseConnections threshold: 100
Implementation
Create Alarm
bash
aws cloudwatch put-metric-alarm \ --alarm-name prod-ec2-high-cpu \ --alarm-description "EC2 CPU > 80% for 15 minutes" \ --namespace AWS/EC2 \ --metric-name CPUUtilization \ --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \ --statistic Average \ --period 300 \ --threshold 80 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 3 \ --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts \ --ok-actions arn:aws:sns:us-east-1:123456789012:alerts \ --treat-missing-data notBreaching
Dashboard Template
json
{
"widgets": [
{
"type": "metric",
"properties": {
"title": "EC2 CPU Utilization",
"metrics": [
["AWS/EC2", "CPUUtilization", "InstanceId", "i-xxx"]
],
"period": 300,
"stat": "Average",
"region": "us-east-1"
}
},
{
"type": "metric",
"properties": {
"title": "ECS Service Memory",
"metrics": [
["AWS/ECS", "MemoryUtilization", "ServiceName", "my-service"]
]
}
}
]
}
Custom Metrics
python
import boto3
cloudwatch = boto3.client('cloudwatch')
# Publish custom metric
cloudwatch.put_metric_data(
Namespace='MyApp',
MetricData=[
{
'MetricName': 'RequestLatency',
'Dimensions': [
{'Name': 'Service', 'Value': 'API'},
{'Name': 'Environment', 'Value': 'prod'}
],
'Value': 150.5,
'Unit': 'Milliseconds'
}
]
)
Log Insights Queries
Error Rate
sql
fields @timestamp, @message | filter @message like /ERROR/ | stats count() as error_count by bin(5m)
Latency Analysis
sql
fields @timestamp, latency
| stats avg(latency) as avg_latency,
pct(latency, 95) as p95_latency,
pct(latency, 99) as p99_latency
by bin(1h)
Top Errors
sql
fields @timestamp, @message | filter @message like /Exception|Error/ | parse @message /(?<error_type>\w+Exception)/ | stats count() as count by error_type | sort count desc | limit 10
Troubleshooting
Common Issues
| Symptom | Cause | Solution |
|---|---|---|
| No data | Metric not emitting | Check CloudWatch Agent |
| Alarm stuck | Insufficient data | Check treat_missing_data |
| Dashboard empty | Wrong namespace | Verify metric source |
| High costs | Too many metrics | Use metric filters |
Debug Checklist
- • CloudWatch Agent installed and running?
- • IAM role allows cloudwatch:PutMetricData?
- • Correct namespace and dimensions?
- • Metric has data in expected period?
- • Alarm threshold reasonable?
- • SNS topic has subscriptions?
Test Template
python
def test_cloudwatch_alarm():
# Arrange
alarm_name = "test-alarm"
# Act
cw.put_metric_alarm(
AlarmName=alarm_name,
MetricName='CPUUtilization',
Namespace='AWS/EC2',
Statistic='Average',
Period=300,
EvaluationPeriods=1,
Threshold=80,
ComparisonOperator='GreaterThanThreshold'
)
# Assert
response = cw.describe_alarms(AlarmNames=[alarm_name])
assert len(response['MetricAlarms']) == 1
# Cleanup
cw.delete_alarms(AlarmNames=[alarm_name])
Assets
- •
assets/alarm-config.yaml- Common alarm configurations