AgentSkillsCN

event-sourcing-patterns

在设计事件存储、实现 CQRS 读写分离、构建投影逻辑,或协调分布式事务时使用。内容涵盖技术选型、一致性处理,以及工作流编排模式。

SKILL.md
--- frontmatter
name: event-sourcing-patterns
description: Use when designing event stores, implementing CQRS read/write separation, building projections, or coordinating distributed sagas. Covers technology selection, consistency handling, and workflow orchestration patterns.

Event Sourcing & CQRS Patterns

Event Store Technology Selection

TechnologyBest ForAvoid If
EventStoreDBPure event sourcing, projections built-inNeed multi-purpose DB
PostgreSQLExisting Postgres stack, SQL expertiseExtreme write throughput (>10K/s)
KafkaHigh-throughput streaming, event busPer-aggregate queries critical
DynamoDBServerless, AWS-native, auto-scalingComplex cross-stream queries

Event Store Schema Design (Postgres)

sql
CREATE TABLE events (
    stream_id VARCHAR(255) NOT NULL,  -- Pattern: "{Type}-{UUID}"
    stream_type VARCHAR(255) NOT NULL,
    event_type VARCHAR(255) NOT NULL,
    event_data JSONB NOT NULL,
    version BIGINT NOT NULL,
    global_position BIGSERIAL,        -- Critical for projections
    created_at TIMESTAMPTZ DEFAULT NOW(),
    CONSTRAINT unique_stream_version UNIQUE (stream_id, version)
);

CREATE INDEX idx_events_stream ON events(stream_id, version);
CREATE INDEX idx_events_global ON events(global_position);  -- Projection catchup
CREATE INDEX idx_events_type ON events(event_type);         -- Type-based subscriptions

Key Decisions:

  • stream_id format: Order-{uuid} > {uuid} alone (enables type-based queries)
  • global_position serial vs timestamp: serial prevents race conditions
  • version per-stream vs global: per-stream enables optimistic concurrency

Event Store Guardrails

  • Immutability: Never UPDATE or DELETE events -- add compensating events instead
  • Optimistic concurrency: Always check expected_version on append to prevent lost updates
  • Event size: Keep <10KB; reference large payloads via URL/S3 key
  • Idempotency: Use event_id for deduplication; append must be idempotent
  • Correlation/Causation IDs: Required for tracing -- metadata.correlation_id

CQRS: Consistency Models

ModelWhenImplementation
EventualDefault -- acceptable lagAsync projections, no write-time coupling
Read-your-writesUser expects immediate visibilityPoll projection until version ≥ write version (5s timeout)
Inline projectionStrong consistency requiredUpdate read model in same transaction as event append

Gotcha: Inline projections couple write/read stores -- breaks scaling independence. Only for single-DB deployments.

Projection Design Patterns

Idempotency

Projections must be idempotent (events replay during catchup/rebuild). Techniques:

  • Upsert with full state (not incremental updates)
  • Track last_processed_version per entity
  • Use ON CONFLICT DO UPDATE (Postgres) or conditional expressions (DynamoDB)

Checkpointing

Store last_processed_global_position per projection:

  • Enables resume after restart
  • Supports independent projection versioning
  • Checkpoint frequency: every event is overkill; batch every 100-1000 events

Rebuild Strategy

Projections must support full rebuild:

  1. Create new projection table (v2)
  2. Replay all events into v2
  3. Atomic swap: rename v1→old, v2→current
  4. Drop old after validation

Never rebuild in-place -- risks data loss on failure.

Projection Types

TypeUse CaseTradeoff
Summary viewOrder totals, countsMust handle out-of-order events
Search indexElasticsearch, AlgoliaExternal dependency, harder rebuild
AggregatesDaily sales rollupsTime-based bucketing complexity
Denormalized joinCustomer + Orders in one docHigher storage, faster queries

Saga & Workflow Orchestration

Choreography vs Orchestration

FactorChoreographyOrchestration
CouplingLoose (services react to events)Tighter (orchestrator knows steps)
VisibilityHard to trace end-to-endEasy (orchestrator holds state)
Complexity ceilingBreaks down at 4+ stepsScales to 10+ steps
Best for2-3 steps, decoupled teamsOrder-dependent steps, complex compensation

Default to orchestration unless <4 steps with simple compensation.

Saga vs Workflow Engine

FeaturePlain SagaWorkflow Engine (Temporal)
RetriesManualBuilt-in with backoff
State persistenceManual saga storeAutomatic
DeterminismNot enforcedEnforced (replay-safe)
VersioningManual migrationworkflow.get_version()
Pick whenSimple compensating txnsLong-running, stateful, multi-service

Compensation Design (Critical)

  • LIFO order: Compensate in reverse execution order (stack, not queue)
  • Idempotency: Compensations retry; design for multiple execution
  • Always succeed: Compensations cannot fail -- if they can, add retry/alert
  • Register before execution: Add compensation to stack before each step
  • Partial compensation: Track completed steps; only compensate those

Gotcha: A saga that completes compensation is "failed successfully" -- distinguish from unrecoverable failures in monitoring.

Workflow Engine Constraints (Temporal-style)

Workflow Code (Deterministic):

  • Prohibited: datetime.now(), random(), threading, I/O, network
  • Use instead: workflow.now(), workflow.random(), activities for side effects

Activity Code (Non-Deterministic):

  • Must be idempotent
  • Must have timeout (activities can hang)
  • Classify errors: retryable (network) vs non-retryable (validation)
  • Use heartbeats for long-running (>30s) activities
  • 2MB payload limit per argument

Gotcha: Workflow code runs repeatedly during replay -- any non-determinism causes divergence errors.

Operational Guardrails

  • Correlation IDs: Propagate through every step for tracing
  • Timeouts on every step: Never wait indefinitely (5min default)
  • Monitor: workflow duration, step failure rate, compensation trigger rate, stuck count
  • Versioning: Never modify running workflow logic; use version gates or new workflow types

Non-Obvious Gotchas

  • Choreography needs saga ID: Even without orchestrator, need correlation across events
  • Eventual consistency SLAs: Define acceptable lag (500ms? 5s?) and monitor breach rate
  • Event versioning from day one: Add event.schema_version; upcasting is harder than prevention
  • Don't query in command handlers: Commands are for writes; breaks CQRS separation
  • Business logic in workflows, not activities: Activities are I/O adapters; decisions belong in workflow
  • Projection lag snowball: If projection falls behind, writes accelerate lag -- needs backpressure or scaling

Do's and Don'ts

Event Store

  • Do: Use stream IDs with type prefix (Order-{uuid})
  • Do: Include correlation/causation IDs in metadata
  • Do: Implement optimistic concurrency on append
  • Don't: Update or delete events
  • Don't: Store large payloads (>10KB)

CQRS

  • Do: Denormalize read models for query patterns
  • Do: Validate in command handlers before state change
  • Do: Define consistency SLAs per feature
  • Don't: Query in command handlers
  • Don't: Couple read/write schemas

Projections

  • Do: Make projections idempotent
  • Do: Store checkpoints for resume
  • Do: Support full rebuild
  • Don't: Couple projections (each is independent)
  • Don't: Ignore projection lag monitoring

Sagas

  • Do: Test compensations more than happy path
  • Do: Use orchestration for >3 steps
  • Do: Set timeouts on every step
  • Don't: Skip correlation IDs
  • Don't: Modify running workflow logic in-place