AgentSkillsCN

aws-observability-setup

CloudWatch、CloudTrail、Config、日志记录与告警的标准模式。

SKILL.md
--- frontmatter
name: aws-observability-setup
description: Standard patterns for CloudWatch, CloudTrail, Config, logging, and alerting
version: 1.0.0
category: aws
agents: [aws-coworker-core, aws-coworker-planner, aws-coworker-observability-cost]
tools: [Read, Bash]

AWS Observability Setup

Purpose

This skill provides standardized patterns for setting up AWS observability including CloudWatch metrics, logs, and alarms; CloudTrail for audit; AWS Config for compliance; and Security Hub for security posture.

When to Use

  • Setting up monitoring for new resources
  • Establishing baseline observability
  • Creating alerting strategies
  • Implementing compliance logging
  • Reviewing observability gaps

When NOT to Use

  • Application-level monitoring (APM tools)
  • Third-party monitoring integration
  • Custom metrics development (specific to apps)

Observability Stack

ComponentPurpose
CloudWatch MetricsPerformance and operational data
CloudWatch LogsCentralized log management
CloudWatch AlarmsAlerting and automated actions
CloudTrailAPI activity audit trail
AWS ConfigConfiguration compliance
VPC Flow LogsNetwork traffic analysis
Security HubSecurity posture aggregation

CloudWatch Metrics

Standard Metrics to Monitor

EC2

MetricAlarm ThresholdPeriod
CPUUtilization> 80%5 min
StatusCheckFailed> 01 min
NetworkIn/OutAnomaly detection5 min
EBSReadOps/WriteOpsBaseline + 2 std dev5 min

RDS

MetricAlarm ThresholdPeriod
CPUUtilization> 80%5 min
FreeStorageSpace< 20% of total5 min
DatabaseConnections> 80% of max5 min
ReadLatency/WriteLatency> baseline5 min
FreeableMemory< 10%5 min

Lambda

MetricAlarm ThresholdPeriod
Errors> 5% of invocations5 min
Duration> 80% of timeout5 min
Throttles> 05 min
ConcurrentExecutions> 80% of limit5 min

ALB/NLB

MetricAlarm ThresholdPeriod
HTTPCode_ELB_5XX_Count> 105 min
HTTPCode_Target_5XX_Count> 105 min
TargetResponseTime> 1 second5 min
UnHealthyHostCount> 01 min

Enabling Detailed Monitoring

bash
# Enable detailed monitoring for EC2
aws ec2 monitor-instances \
  --instance-ids i-xxxxxxxxx \
  --profile {profile} \
  --region {region}

# Enable enhanced monitoring for RDS
aws rds modify-db-instance \
  --db-instance-identifier {db-id} \
  --monitoring-interval 60 \
  --monitoring-role-arn arn:aws:iam::{account}:role/rds-monitoring-role \
  --profile {profile} \
  --region {region}

CloudWatch Alarms

Alarm Configuration Pattern

yaml
# CloudFormation example
Resources:
  CPUAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-cpu-high"
      AlarmDescription: CPU utilization exceeds 80%
      MetricName: CPUUtilization
      Namespace: AWS/EC2
      Statistic: Average
      Period: 300
      EvaluationPeriods: 2
      Threshold: 80
      ComparisonOperator: GreaterThanThreshold
      Dimensions:
        - Name: InstanceId
          Value: !Ref MyInstance
      AlarmActions:
        - !Ref AlertSNSTopic
      OKActions:
        - !Ref AlertSNSTopic

CLI Alarm Creation

bash
# Create CPU alarm
aws cloudwatch put-metric-alarm \
  --alarm-name "prod-web-cpu-high" \
  --alarm-description "CPU utilization exceeds 80%" \
  --metric-name CPUUtilization \
  --namespace AWS/EC2 \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --dimensions Name=InstanceId,Value=i-xxxxxxxxx \
  --alarm-actions arn:aws:sns:{region}:{account}:alerts \
  --profile {profile} \
  --region {region}

Alarm Naming Convention

code
{env}-{service}-{metric}-{condition}

Examples:
- prod-web-cpu-high
- dev-rds-storage-low
- staging-lambda-errors-high

CloudWatch Logs

Log Group Configuration

bash
# Create log group with retention
aws logs create-log-group \
  --log-group-name /aws/lambda/{function-name} \
  --profile {profile} \
  --region {region}

aws logs put-retention-policy \
  --log-group-name /aws/lambda/{function-name} \
  --retention-in-days 30 \
  --profile {profile} \
  --region {region}

Standard Log Groups

ServiceLog Group PatternRetention
Lambda/aws/lambda/{function}30 days
ECS/ecs/{cluster}/{service}30 days
API Gateway/aws/api-gateway/{api}30 days
VPC Flow Logs/vpc/flow-logs/{vpc}90 days
Application/app/{service}/{env}30-90 days

Log Insights Queries

sql
-- Error analysis
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

-- Lambda cold starts
fields @timestamp, @message, @duration
| filter @type = "REPORT" and @initDuration > 0
| sort @timestamp desc

-- Request latency percentiles
stats avg(@duration), pct(@duration, 95), pct(@duration, 99) by bin(5m)

CloudTrail

Trail Configuration

bash
# Create organization trail
aws cloudtrail create-trail \
  --name org-audit-trail \
  --s3-bucket-name {audit-bucket} \
  --is-organization-trail \
  --is-multi-region-trail \
  --enable-log-file-validation \
  --include-global-service-events \
  --profile {profile} \
  --region {region}

# Start logging
aws cloudtrail start-logging \
  --name org-audit-trail \
  --profile {profile} \
  --region {region}

CloudTrail Best Practices

markdown
## CloudTrail Configuration Checklist

- [ ] Multi-region trail enabled
- [ ] Log file validation enabled
- [ ] S3 bucket with encryption
- [ ] S3 bucket with access logging
- [ ] CloudWatch Logs integration
- [ ] Global service events included
- [ ] Data events for critical S3/Lambda (optional)

CloudTrail Event Analysis

bash
# Query recent events via CLI
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=RunInstances \
  --start-time 2024-01-01T00:00:00Z \
  --profile {profile} \
  --region {region}

VPC Flow Logs

Enable Flow Logs

bash
# Create log group
aws logs create-log-group \
  --log-group-name /vpc/flow-logs/{vpc-id} \
  --profile {profile} \
  --region {region}

# Create flow log
aws ec2 create-flow-logs \
  --resource-ids vpc-xxxxxxxxx \
  --resource-type VPC \
  --traffic-type ALL \
  --log-destination-type cloud-watch-logs \
  --log-group-name /vpc/flow-logs/{vpc-id} \
  --deliver-logs-permission-arn arn:aws:iam::{account}:role/flow-logs-role \
  --profile {profile} \
  --region {region}

Flow Log Analysis

sql
-- Rejected traffic
fields @timestamp, srcAddr, dstAddr, dstPort, action
| filter action = "REJECT"
| sort @timestamp desc
| limit 100

-- Top talkers by bytes
stats sum(bytes) as totalBytes by srcAddr
| sort totalBytes desc
| limit 20

AWS Config

Enable Config Recording

bash
# Create config recorder
aws configservice put-configuration-recorder \
  --configuration-recorder name=default,roleARN=arn:aws:iam::{account}:role/config-role \
  --recording-group allSupported=true,includeGlobalResourceTypes=true \
  --profile {profile} \
  --region {region}

# Create delivery channel
aws configservice put-delivery-channel \
  --delivery-channel name=default,s3BucketName={config-bucket} \
  --profile {profile} \
  --region {region}

# Start recording
aws configservice start-configuration-recorder \
  --configuration-recorder-name default \
  --profile {profile} \
  --region {region}

Essential Config Rules

RulePurpose
s3-bucket-public-read-prohibitedNo public S3 buckets
encrypted-volumesEBS encryption required
rds-storage-encryptedRDS encryption required
ec2-instance-managed-by-ssmSystems Manager coverage
vpc-flow-logs-enabledFlow logs required

Security Hub

Enable Security Hub

bash
# Enable Security Hub
aws securityhub enable-security-hub \
  --enable-default-standards \
  --profile {profile} \
  --region {region}

# Enable specific standards
aws securityhub batch-enable-standards \
  --standards-subscription-requests \
    StandardsArn=arn:aws:securityhub:::ruleset/cis-aws-foundations-benchmark/v/1.2.0 \
  --profile {profile} \
  --region {region}

Security Hub Findings

bash
# Get critical findings
aws securityhub get-findings \
  --filters '{"SeverityLabel":[{"Value":"CRITICAL","Comparison":"EQUALS"}]}' \
  --profile {profile} \
  --region {region}

Dashboard Template

Production Dashboard

json
{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "title": "EC2 CPU Utilization",
        "metrics": [
          ["AWS/EC2", "CPUUtilization", "InstanceId", "i-xxx"]
        ],
        "period": 300
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "ALB Request Count",
        "metrics": [
          ["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/xxx"]
        ],
        "period": 60
      }
    },
    {
      "type": "metric",
      "properties": {
        "title": "RDS Connections",
        "metrics": [
          ["AWS/RDS", "DatabaseConnections", "DBInstanceIdentifier", "xxx"]
        ],
        "period": 60
      }
    }
  ]
}

Observability Checklist

markdown
## Full Observability Checklist

### CloudWatch
- [ ] Detailed monitoring enabled
- [ ] Custom metrics where needed
- [ ] Alarms for critical metrics
- [ ] Dashboard created
- [ ] Anomaly detection configured

### Logging
- [ ] CloudWatch Logs for all services
- [ ] Retention policies set
- [ ] Log Insights queries saved
- [ ] Subscription filters for alerts

### Audit
- [ ] CloudTrail enabled (multi-region)
- [ ] Log file validation on
- [ ] S3 bucket secure
- [ ] CloudWatch Logs integration

### Network
- [ ] VPC Flow Logs enabled
- [ ] Traffic analysis configured
- [ ] Rejected traffic monitored

### Security
- [ ] Security Hub enabled
- [ ] Findings reviewed regularly
- [ ] Standards enabled (CIS, etc.)

### Compliance
- [ ] AWS Config recording
- [ ] Config rules deployed
- [ ] Non-compliance alerts

Related Skills

  • aws-cli-playbook — CLI patterns for setup
  • aws-well-architected — Operational excellence pillar
  • aws-governance-guardrails — Compliance requirements
  • aws-cost-optimizer — Cost of observability