Pipeline Design

Name: pipeline-design
Rating: 87
Author: timequity

ETL vs ELT

Approach	When to Use
ETL	Transform before load, limited warehouse compute
ELT	Modern warehouses (Snowflake, BigQuery, Redshift)

Pipeline Patterns

Batch

code

Source → Extract → Stage → Transform → Load → Target
         │                    │
         └── Checkpoint ──────┘

•Scheduled intervals (hourly, daily)
•Full or incremental loads
•Idempotent operations

Streaming

code

Source → Kafka/Kinesis → Process → Sink
              │
              └── State Store

•Real-time requirements
•Event-driven architecture
•Exactly-once semantics

Design Principles

•Idempotent - Safe to re-run
•Incremental - Process only new/changed data
•Observable - Metrics, logs, alerts
•Testable - Unit tests for transformations
•Recoverable - Checkpoints, retry logic

Staging Pattern

sql

-- 1. Land raw data
COPY INTO raw.source_data FROM @stage;

-- 2. Deduplicate
CREATE TABLE staging.deduped AS
SELECT * FROM raw.source_data
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY _loaded_at DESC) = 1;

-- 3. Transform to target
MERGE INTO target.dim_customer
USING staging.deduped
ON target.id = staging.id
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...;

Error Handling

•Dead letter queues for failed records
•Retry with exponential backoff
•Alert on threshold breaches
•Quarantine bad data for review