AgentSkillsCN

operational-patterns

为生产就绪的系统提供安全架构、可观测性、CI/CD 流水线、数据库迁移,以及环境策略模式。

SKILL.md
--- frontmatter
name: operational-patterns
description: Security architecture, observability, CI/CD pipelines, database migrations, and environment strategy patterns for production-ready systems

Operational Patterns

Patterns and recommendations for security, observability, CI/CD, database migrations, and environment management. Use when generating the Security Architecture, Observability, and DevOps Blueprint deliverables.


Security Architecture

Auth Strategy Selection

Project TypeRecommended AuthRationale
Simple app, few rolesClerk or Supabase AuthManaged service, minimal setup, built-in UI components
Multi-tenant SaaSAuth0 or Clerk (organizations)Organization-level isolation, role management, SSO
API-only / B2BAPI keys + JWTSimple machine-to-machine auth
Enterprise / compliance-heavyAuth0 or Keycloak (self-hosted)Fine-grained control, audit logs, compliance certifications
Mobile appFirebase Auth or ClerkNative SDK support, social login, biometrics

API Security Checklist

Every REST API service should implement these protections:

ProtectionImplementationPriority
Rate limitingExpress-rate-limit, Upstash Ratelimit, or API gateway rate limitsMust-have
Input validationZod schemas on all request bodies and query params. Reject unknown fields.Must-have
CORSWhitelist specific origins. Never use * in production.Must-have
Helmet headershelmet middleware for security headers (CSP, X-Frame-Options, HSTS)Must-have
SQL/NoSQL injectionParameterized queries only. Never interpolate user input into queries.Must-have
XSS preventionSanitize HTML output. Use frameworks with built-in escaping (React, Next.js).Must-have
CSRF protectionSameSite cookies + CSRF tokens for cookie-based auth. Not needed for bearer-only APIs.Conditional
Request size limitsLimit body size (e.g. 1MB default, higher for file uploads)Should-have
AuthenticationVerify JWT/session on every protected route. Middleware, not per-route.Must-have
AuthorizationCheck user roles/permissions after authentication. Separate middleware.Must-have

Data Protection

ConcernRecommendation
Encryption at restUse database provider's built-in encryption (RDS, Supabase, MongoDB Atlas all encrypt by default)
Encryption in transitTLS 1.3 on all endpoints. Enforce HTTPS redirects. Internal service-to-service can use HTTP if within VPC.
PII handlingIdentify PII fields (email, name, phone, address, IP). Log them only when necessary. Mask in non-prod environments.
Secrets managementNever hardcode secrets. Use platform env vars (Vercel, Railway) or a secrets manager (Doppler, AWS SSM). Rotate API keys periodically.
Data retentionDefine retention periods per data type. Implement soft deletes for user data. Support data export/deletion for GDPR.
BackupsAutomated daily database backups with point-in-time recovery. Test restores quarterly.

OWASP Top 10 Quick Reference

When assessing security, check against these common threats:

  1. Broken Access Control — Ensure authorization checks on every endpoint, not just authentication
  2. Cryptographic Failures — Don't store passwords in plaintext (use bcrypt/argon2 via auth provider), don't log tokens
  3. Injection — Parameterized queries, input validation, no eval/exec of user input
  4. Insecure Design — Rate limit auth endpoints, implement account lockout, use CAPTCHA on public forms
  5. Security Misconfiguration — Remove default credentials, disable debug mode in production, review CORS
  6. Vulnerable Components — Keep dependencies updated, run npm audit / pip audit in CI
  7. Authentication Failures — Use a managed auth provider, implement MFA for admin roles
  8. Data Integrity Failures — Verify webhook signatures, validate JWT signatures, use CSP headers
  9. Logging Failures — Log auth events, access denials, and input validation failures
  10. SSRF — Validate and whitelist URLs before making server-side requests

Observability

Observability Stack Recommendations

Project SizeLoggingTracingMetricsAlertingMonthly Cost
MVP / startupAxiom (free tier) or console + Vercel logsNot needed yetVercel/Railway built-inSentry (free tier)$0
Growing (1K-10K users)Axiom or BetterstackSentry performancePostHog + SentrySentry + Slack webhooks$20-50/mo
Production (10K+ users)Datadog or Grafana CloudOpenTelemetry → Jaeger/DatadogPrometheus + Grafana or DatadogPagerDuty + Datadog$100-500/mo
EnterpriseDatadog or SplunkDatadog APM or HoneycombDatadog or custom PrometheusPagerDuty + Datadog$500+/mo

Structured Logging

All services should use structured JSON logging:

json
{
  "level": "info",
  "timestamp": "2026-02-07T10:30:00.000Z",
  "service": "api-server",
  "requestId": "req_abc123",
  "userId": "usr_xyz",
  "action": "create_order",
  "duration_ms": 145,
  "message": "Order created successfully"
}

Logging rules:

  • Use log levels consistently: error (broken), warn (degraded), info (business events), debug (dev only)
  • Include requestId for request correlation across services
  • Include userId for audit trail (mask in logs if compliance requires)
  • Never log: passwords, tokens, full credit card numbers, API keys
  • Always log: auth failures, permission denials, input validation errors, external API errors

Key Metrics to Track

CategoryMetricAlert Threshold
AvailabilityUptime percentage< 99.5% over 24h
LatencyRequest duration p50, p95, p99p99 > 2s
Error rate5xx errors / total requests> 1% over 5 minutes
ThroughputRequests per secondUnusual spike or drop (>3x baseline)
Queue depthJobs waiting in queue> 1000 for > 5 minutes
DatabaseConnection pool usage, query durationPool > 80%, queries > 500ms
AI/LLMToken usage, response time, failure rateFailure rate > 5%, response > 30s
BusinessSignups, conversions, active usersUnusual drops (context-dependent)

Health Check Pattern

Every service exposes /health with tiered checks:

json
{
  "status": "healthy",
  "service": "api-server",
  "version": "1.2.3",
  "uptime_seconds": 86400,
  "checks": {
    "database": { "status": "healthy", "latency_ms": 5 },
    "redis": { "status": "healthy", "latency_ms": 2 },
    "external_api": { "status": "degraded", "latency_ms": 1500, "note": "slow but responding" }
  }
}
  • /health — quick liveness check (returns 200 if process is running)
  • /health/ready — readiness check (returns 200 only if all dependencies are reachable)
  • Used by load balancers, container orchestrators, and monitoring

CI/CD Pipeline

Pipeline Templates by Provider

GitHub Actions (recommended for most projects):

code
Stages: lint → test → build → deploy

Triggers:
  - Push to main → deploy to production
  - Push to develop → deploy to staging
  - Pull request → run lint + test only
  - Manual dispatch → deploy to any environment

Pipeline stages:

StageWhat It DoesTools
LintCode style, formatting, type checkingESLint, Prettier, tsc --noEmit / Ruff, mypy
TestUnit tests, integration testsJest, Vitest, pytest
BuildCompile, bundle, Docker imagetsc, next build, docker build
SecurityDependency audit, secret scanningnpm audit, pip audit, Trivy, GitGuardian
DeployPush to hosting providerVercel CLI, Railway CLI, AWS CDK, Docker push

Branch Strategy Selection

Team SizeRecommended StrategyWorkflow
Solo / 1-2 devsgithub-flowmain + feature branches. Merge via PR. Deploy on merge to main.
3-5 devsgithub-flowSame, but require PR reviews. Use staging environment for pre-prod testing.
5-10 devsgitflow or trunk-basedGitflow if you need scheduled releases. Trunk-based if you deploy continuously.
10+ devstrunk-based with feature flagsShort-lived branches (<1 day). Feature flags for incomplete features.

Environment Promotion

code
Feature branch → PR review → merge to develop → auto-deploy to staging →
manual promote to production (merge develop → main) → auto-deploy to production

For simpler projects:

code
Feature branch → PR review → merge to main → auto-deploy to production

Database Migrations

Migration Tool Selection

StackRecommended ToolAlternatives
Node.js + PostgreSQLPrisma MigrateKnex, TypeORM, Drizzle Kit
Node.js + MongoDBMongoose (schema-on-read)migrate-mongo
Python + PostgreSQLAlembicDjango migrations, SQLAlchemy-migrate
Python + MongoDBNo formal migrations neededmongomock for testing

Migration Strategy

ConcernRecommendation
VersioningSequential numbered migrations (001_create_users.sql, 002_add_orders.sql). Never edit applied migrations.
RollbackEvery migration has an up and a down. Test rollbacks before deploying.
CI integrationRun pending migrations automatically in CI before tests. Run in staging before production.
Zero-downtimeAvoid breaking changes in one step. Add column → backfill → make required → remove old.
Seed dataDev seeds: faker/factory data for local development. Staging seeds: anonymized subset of production data.
ProductionRun migrations before deploying new code. Use advisory locks to prevent concurrent migrations.

Common Migration Patterns

PatternWhenExample
Add nullable columnSafe, no downtimeALTER TABLE users ADD COLUMN phone TEXT;
Rename columnRequires migration in 2 stepsStep 1: Add new column + backfill. Step 2: Drop old column.
Add indexCan lock table on large datasetsUse CREATE INDEX CONCURRENTLY on PostgreSQL
Change column typeRisky — may lose dataCreate new column, migrate data, drop old column

Environment Strategy

Environment Definitions

EnvironmentPurposeDataAccessDeploy Trigger
LocalDeveloper machineSeed data / Docker ComposeDeveloper onlyManual
DevelopmentShared dev environmentSeed dataDev teamPush to develop
StagingPre-production testingAnonymized prod data or rich seedsDev team + QAPush to staging or manual promote
ProductionLive usersReal dataRestricted accessPush to main or manual promote

Config Management

Environment variable categories:

CategoryExamplesWhere Stored
Service configPORT, NODE_ENV, LOG_LEVEL.env file (local), platform env vars (deployed)
DatabaseDATABASE_URL, REDIS_URLPlatform env vars, secrets manager
Third-party API keysSTRIPE_SECRET_KEY, SENDGRID_API_KEYSecrets manager (Doppler, AWS SSM)
Feature flagsENABLE_AI_AGENT, ENABLE_BETA_FEATURESFeature flag service or env vars
Internal service URLsAPI_SERVER_URL, AGENT_SERVICE_URLPlatform env vars, service discovery

Config validation:

  • Validate all environment variables on service startup using Zod, envalid, or pydantic-settings
  • Fail fast with clear error messages if required vars are missing
  • Log which environment the service is running in (but never log secret values)

Feature Flags

ApproachWhen to UseTool
Environment variablesSimple on/off for 1-2 featuresENABLE_FEATURE_X=true
Config fileMultiple flags, no runtime changes neededfeatures.json loaded on startup
Feature flag serviceRuntime toggling, gradual rollouts, A/B testingLaunchDarkly ($10/mo), Unleash (open source), PostHog (free tier)

Choosing What to Include

Not every project needs all operational patterns. Use this guide:

Project StageIncludeSkip
MVP / proof of conceptBasic auth, console logging, simple CI (lint + test + deploy), env varsTracing, alerting, feature flags, multi-environment
Early startup (pre-product-market fit)Managed auth, structured logging, Sentry, GitHub Actions CI/CD, staging envAPM, custom metrics, PagerDuty, complex migration strategy
Growing product (1K+ users)All security checklist items, observability stack, full CI/CD pipeline, migration tooling, staging + productionEnterprise compliance, self-hosted tooling
Production / enterpriseEverything above + compliance audits, APM, distributed tracing, PagerDuty, feature flags, multi-regionNothing — you need it all

When generating blueprints, match the depth to the project's stage and complexity. Don't overwhelm an MVP with enterprise patterns.