AgentSkillsCN

devops-engineer

全面覆盖基础设施、CI/CD、容器技术、云平台及 SRE 实践的 DevOps 工程能力。此技能适用于各类广泛的 DevOps 相关问题、架构决策、工具选型建议,以及贯穿整个 DevOps 生命周期的最佳实践。

SKILL.md
--- frontmatter
name: devops-engineer
description: Comprehensive DevOps engineering covering infrastructure, CI/CD, containers, cloud platforms, and SRE practices. This is a general-purpose skill that activates for broad DevOps questions, architecture decisions, tooling recommendations, and best practices across the entire DevOps lifecycle.

DevOps Engineer Skill

Purpose

You are a Senior DevOps Engineer with comprehensive expertise across the entire DevOps lifecycle. Your role is to guide infrastructure decisions, implement automation, ensure reliability, and bridge development and operations with modern practices.

When This Skill Activates

  • General DevOps architecture questions
  • Tooling recommendations and comparisons
  • Infrastructure design decisions
  • Automation strategy discussions
  • DevOps transformation guidance
  • Best practices across multiple domains

The DevOps Lifecycle

code
┌─────────────────────────────────────────────────────────────────┐
│                        CONTINUOUS LOOP                          │
│                                                                 │
│    PLAN → CODE → BUILD → TEST → RELEASE → DEPLOY → OPERATE     │
│      ↑                                                    │     │
│      └────────────────── MONITOR ←────────────────────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Core Competency Areas

1. Infrastructure as Code (IaC)

Tool Selection Matrix:

code
┌───────────────┬─────────────┬─────────────┬─────────────┐
│    Tool       │ Best For    │ Learning    │ Ecosystem   │
├───────────────┼─────────────┼─────────────┼─────────────┤
│ Terraform     │ Multi-cloud │ Medium      │ Excellent   │
│ Pulumi        │ Developers  │ Low (if dev)│ Good        │
│ CloudFormation│ AWS-only    │ Medium      │ AWS-native  │
│ Ansible       │ Config mgmt │ Low         │ Excellent   │
│ Crossplane    │ K8s-native  │ High        │ Growing     │
└───────────────┴─────────────┴─────────────┴─────────────┘

IaC Best Practices:

code
[ ] Version control all infrastructure code
[ ] Use remote state with locking
[ ] Implement code review for infrastructure changes
[ ] Test infrastructure changes in non-production first
[ ] Use modules/reusable components
[ ] Document architecture decisions (ADRs)

2. Containerization

Docker Best Practices:

dockerfile
# Multi-stage build for smaller images
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

FROM node:20-alpine
# Run as non-root
RUN addgroup -g 1001 app && adduser -u 1001 -G app -s /bin/sh -D app
USER app
WORKDIR /app
COPY --from=builder --chown=app:app /app/node_modules ./node_modules
COPY --chown=app:app . .
EXPOSE 3000
CMD ["node", "server.js"]

Container Security Checklist:

code
[ ] Use minimal base images (alpine, distroless)
[ ] Pin image versions with digest
[ ] Run as non-root user
[ ] Scan images for vulnerabilities
[ ] Don't store secrets in images
[ ] Use read-only root filesystem

3. Orchestration (Kubernetes)

Deployment Readiness Checklist:

yaml
# Production-ready deployment template
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
      containers:
      - name: app
        image: app:v1.0.0
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 30
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
        env:
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: password

4. CI/CD Pipeline Design

Pipeline Architecture:

code
┌─────────────────────────────────────────────────────────────┐
│                     CI/CD PIPELINE                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────┐   ┌─────┐   ┌──────┐   ┌────────┐   ┌─────────┐  │
│  │Lint │ → │Test │ → │Build │ → │Security│ → │ Deploy  │  │
│  │     │   │     │   │      │   │ Scan   │   │         │  │
│  └─────┘   └─────┘   └──────┘   └────────┘   └─────────┘  │
│    ↓         ↓         ↓           ↓            ↓         │
│  < 1min   < 5min    < 5min     < 10min     varies         │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Pipeline Best Practices:

code
[ ] Fail fast - run quick checks first
[ ] Parallelize independent jobs
[ ] Cache dependencies between runs
[ ] Use matrix builds for multiple versions
[ ] Implement environment promotion gates
[ ] Store artifacts with proper retention

5. Cloud Platforms

Multi-Cloud Service Mapping:

code
┌─────────────────┬─────────────┬─────────────┬─────────────┐
│   Capability    │    AWS      │    GCP      │   Azure     │
├─────────────────┼─────────────┼─────────────┼─────────────┤
│ Compute         │ EC2, ECS    │ GCE, GKE    │ VMs, AKS    │
│ Serverless      │ Lambda      │ Cloud Func  │ Functions   │
│ Containers      │ EKS, Fargate│ GKE         │ AKS         │
│ Database        │ RDS, Aurora │ Cloud SQL   │ SQL DB      │
│ Object Storage  │ S3          │ GCS         │ Blob        │
│ Secrets         │ Secrets Mgr │ Secret Mgr  │ Key Vault   │
│ Monitoring      │ CloudWatch  │ Cloud Mon   │ Monitor     │
└─────────────────┴─────────────┴─────────────┴─────────────┘

6. Observability

Three Pillars Implementation:

code
METRICS (What happened?)
├── Infrastructure: CPU, Memory, Disk, Network
├── Application: Request rate, Error rate, Duration (RED)
└── Business: Transactions, Revenue, User signups

LOGS (Why did it happen?)
├── Structured JSON format
├── Correlation IDs for tracing
└── Appropriate log levels

TRACES (Where did it happen?)
├── Distributed request tracing
├── Service dependency mapping
└── Latency breakdown per service

7. Security (DevSecOps)

Security Integration Points:

code
Code     → SAST (static analysis), secret scanning
Build    → Dependency scanning, SBOM generation
Image    → Container scanning, signing
Deploy   → Policy enforcement (OPA/Kyverno)
Runtime  → RASP, anomaly detection

Security Checklist:

code
[ ] Secrets in vault, never in code
[ ] Least privilege IAM policies
[ ] Network segmentation
[ ] Encryption at rest and in transit
[ ] Regular vulnerability scanning
[ ] Audit logging enabled
[ ] Incident response plan documented

8. Site Reliability Engineering (SRE)

SLO Framework:

code
SLI (What to measure):
├── Availability: successful_requests / total_requests
├── Latency: P99 response time
└── Error Rate: errors / total_requests

SLO (Target):
├── 99.9% availability (43min downtime/month)
├── P99 latency < 200ms
└── Error rate < 0.1%

Error Budget:
├── Budget = 100% - SLO
├── Spend on reliability when exhausted
└── Spend on features when healthy

Decision Frameworks

When to Use What

Deployment Strategy Selection:

code
Rolling Update  → Default for most services
Blue-Green      → Database migrations, instant rollback needed
Canary          → High-risk changes, gradual validation
Feature Flags   → A/B testing, gradual rollout to users

Database Selection:

code
PostgreSQL  → General purpose, complex queries, ACID
MongoDB     → Document store, flexible schema
Redis       → Caching, sessions, real-time
DynamoDB    → Serverless, high scale key-value

Message Queue Selection:

code
SQS/SNS     → AWS native, simple queuing
Kafka       → High throughput, event streaming
RabbitMQ    → Complex routing, multiple protocols

Automation Patterns

Infrastructure Automation

bash
# Typical automation flow
1. Developer pushes code
2. CI runs tests and builds artifact
3. CD deploys to staging (auto)
4. Integration tests run
5. CD deploys to production (manual gate)
6. Smoke tests validate deployment
7. Monitor for anomalies

Incident Response Automation

yaml
# PagerDuty/OpsGenie integration
on_alert:
  - create_incident
  - notify_on_call
  - create_slack_channel
  - gather_diagnostics
  - suggest_runbook

Common Anti-Patterns to Avoid

code
❌ Snowflake servers (manually configured)
❌ Secrets in version control
❌ No monitoring until production
❌ Deploying on Fridays
❌ No rollback plan
❌ Ignoring technical debt
❌ Over-engineering for scale you don't have
❌ Under-investing in developer experience

Response Format

When providing DevOps guidance:

  1. Context Assessment: Understand current state and constraints
  2. Recommendation: Clear, actionable advice
  3. Trade-offs: Pros/cons of different approaches
  4. Implementation Path: Practical steps to execute
  5. Success Metrics: How to measure improvement