AgentSkillsCN

world-expert-standards

世界一流工程团队(如 Google、Netflix、Uber)所采用的先进、高扩展性的架构与运营模式。

SKILL.md
--- frontmatter
name: world-expert-standards
description: Advanced, high-scale architectural and operational patterns used by world-class engineering teams at companies like Google, Netflix, and Uber.

World-Expert Engineering Standards

If world-class experts built this project, they would move from "functionally stable" to "operationally indestructible." Here are the advanced patterns they would implement.

1. Governance & Communication

Contract-First Development (Schema Registry)

  • Standard: We used JSON over Dapr.
  • Expert: Use Protobuf or Avro with a Confluent Schema Registry. This prevents "Breaking Changes" because Kafka won't accept a message that doesn't match the versioned schema. It enforces a strict contract between services.

Advanced Service Mesh

  • Standard: Dapr for sidecar abstraction.
  • Expert: Layer Dapr with Istio or Linkerd. This allows for Canary Deployments (sending 5% of traffic to a new version) and Traffic Shadowing (copying production traffic to a test service without affecting users).

2. Robust Reliability (Chaos Engineering)

Chaos Mesh / Litmus

  • Standard: Fixed resources and stable pods.
  • Expert: Intentionally kill Kafka partitions, delay Dapr sidecars, and simulate high network latency using Chaos Engineering tools. If the system doesn't "self-heal" automatically under stress, it isn't expert-level yet.

Testcontainers for Integration

  • Standard: Testing against a live Minikube.
  • Expert: Use Testcontainers in CI/CD to spin up real Kafka/Postgres instances for every PR. This ensures that the code works against the exact infrastructure version used in production.

3. Infrastructure as Code (GitOps)

GitOps (ArgoCD / Flux)

  • Standard: Manual helm upgrade.
  • Expert: The cluster should observe a Git repository. Any change to the Helm chart in Git is automatically reconciled by ArgoCD. No human should ever run helm upgrade or kubectl apply manually in production.

Policy as Code (Kyverno / OPA)

  • Standard: Manual checks for security.
  • Expert: Use Open Policy Agent (OPA) to prevent anyone from deploying a pod that doesn't have resource limits, or a pod that runs as root. The cluster "refuses" bad configurations automatically.

4. Deep Observability (Beyond Logging)

Semantic Tracing

  • Standard: Dapr default tracing.
  • Expert: Implement Full OpenTelemetry Workflows. Every business event (e.g., "Reminder Created") should carry a trace_id that links the Frontend UI action, the Backend API, the Kafka Message, and the Worker Service. You should be able to see exactly why a specific message was delayed by 2 seconds.

SLIs/SLOs (Service Level Objectives)

  • Standard: "Is the pod running?"
  • Expert: Define SLOs like "99.9% of reminders must be processed within 500ms." If the latency crosses this threshold, the system alerts the team before the users notice a problem.

5. Security (Zero Trust)

Dynamic Secret Rotation

  • Standard: Kubernetes Secrets (static).
  • Expert: Use HashiCorp Vault with Kubernetes Auth. Secrets (like DB passwords) should only live for 1 hour and be automatically rotated. If a pod is compromised, the stolen password becomes useless within minutes.

Formal Verification

  • Standard: Unit tests.
  • Expert: Use TLA+ or similar tools to formally verify the logic of the "Recurring Tasks" algorithm, ensuring there are no hidden race conditions in the distributed state.