AgentSkillsCN

pipeline-design

使用恰当模式设计ETL/ELT管道架构,以实现可靠性和可扩展性。

SKILL.md
--- frontmatter
name: pipeline-design
description: Design ETL/ELT pipeline architectures with proper patterns for reliability and scalability.

Pipeline Design

ETL vs ELT

ApproachWhen to Use
ETLTransform before load, limited warehouse compute
ELTModern warehouses (Snowflake, BigQuery, Redshift)

Pipeline Patterns

Batch

code
Source → Extract → Stage → Transform → Load → Target
         │                    │
         └── Checkpoint ──────┘
  • Scheduled intervals (hourly, daily)
  • Full or incremental loads
  • Idempotent operations

Streaming

code
Source → Kafka/Kinesis → Process → Sink
              │
              └── State Store
  • Real-time requirements
  • Event-driven architecture
  • Exactly-once semantics

Design Principles

  1. Idempotent - Safe to re-run
  2. Incremental - Process only new/changed data
  3. Observable - Metrics, logs, alerts
  4. Testable - Unit tests for transformations
  5. Recoverable - Checkpoints, retry logic

Staging Pattern

sql
-- 1. Land raw data
COPY INTO raw.source_data FROM @stage;

-- 2. Deduplicate
CREATE TABLE staging.deduped AS
SELECT * FROM raw.source_data
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY _loaded_at DESC) = 1;

-- 3. Transform to target
MERGE INTO target.dim_customer
USING staging.deduped
ON target.id = staging.id
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...;

Error Handling

  • Dead letter queues for failed records
  • Retry with exponential backoff
  • Alert on threshold breaches
  • Quarantine bad data for review