Data Pipeline Architect

This skill provides guidance for designing robust, scalable data pipelines that move data reliably from sources to destinations.

Core Competencies

•ETL vs ELT: Traditional Extract-Transform-Load vs modern Extract-Load-Transform patterns
•Orchestration: Airflow, Dagster, Prefect, dbt for workflow management
•Data Quality: Validation, monitoring, lineage tracking
•Scalability: Batch vs streaming, partitioning, parallelization

Pipeline Design Process

1. Requirements Analysis

To begin pipeline design, gather:

•Source systems and data formats (APIs, databases, files, streams)
•Target destinations (data warehouse, lake, lakehouse)
•Freshness requirements (real-time, hourly, daily)
•Data volume and velocity estimates
•Quality and compliance requirements

2. Architecture Selection

Batch Pipelines - For periodic bulk processing:

•Schedule-driven (hourly, daily, weekly)
•Higher latency tolerance
•Simpler error recovery (re-run entire batch)
•Tools: Airflow, dbt, Spark

Streaming Pipelines - For real-time requirements:

•Event-driven processing
•Sub-second to minute latency
•Complex state management
•Tools: Kafka, Flink, Spark Streaming

Hybrid Approaches - Lambda or Kappa architecture:

•Batch layer for completeness
•Speed layer for low latency
•Serving layer for queries

3. ETL vs ELT Decision

ETL (Transform before Load):

•When target has limited compute
•When transformation reduces data volume significantly
•When sensitive data must be masked before landing
•Legacy data warehouse patterns

ELT (Transform after Load):

•Modern cloud warehouses with cheap compute
•When raw data preservation is needed
•When transformations change frequently
•dbt-style transformations in warehouse

4. Pipeline Components

Extraction Layer:

•Full extraction vs incremental (CDC, timestamp-based)
•API pagination and rate limiting
•Connection pooling and retry logic
•Schema detection and drift handling

Transformation Layer:

•Data cleansing and standardization
•Business logic application
•Aggregation and denormalization
•Type casting and null handling

Loading Layer:

•Upsert strategies (merge, delete+insert)
•Partitioning schemes (time, hash, range)
•Index management
•Transaction boundaries

5. Error Handling Patterns

code

┌─────────────────────────────────────────────────────────┐
│                    Pipeline Execution                    │
├─────────────────────────────────────────────────────────┤
│  ┌─────────┐    ┌───────────┐    ┌──────────┐          │
│  │ Extract │───▶│ Transform │───▶│   Load   │          │
│  └────┬────┘    └─────┬─────┘    └────┬─────┘          │
│       │               │               │                 │
│       ▼               ▼               ▼                 │
│  ┌─────────┐    ┌───────────┐    ┌──────────┐          │
│  │  Retry  │    │ Dead Letter│    │ Rollback │          │
│  │ w/Backoff│   │   Queue   │    │ Checkpoint│          │
│  └─────────┘    └───────────┘    └──────────┘          │
└─────────────────────────────────────────────────────────┘

•Retry with backoff: Transient failures (network, rate limits)
•Dead letter queues: Poison messages that can't be processed
•Checkpointing: Resume from last successful point
•Idempotency: Safe to re-run without duplicates

6. Data Quality Framework

Implement checks at each stage:

Stage	Check Type	Example
Extract	Completeness	Row count matches source
Extract	Freshness	Data timestamp within SLA
Transform	Validity	Values in expected ranges
Transform	Uniqueness	Primary keys unique
Load	Reconciliation	Target matches source totals
Load	Integrity	Foreign keys valid

7. Monitoring and Observability

Essential metrics to track:

•Pipeline duration and trends
•Row counts at each stage
•Error rates and types
•Data freshness (time since last successful run)
•Resource utilization

Alert on:

•SLA breaches (data not fresh)
•Anomalous row counts (±20% from baseline)
•Schema changes in sources
•Repeated failures

Common Patterns

Slowly Changing Dimensions (SCD)

•Type 1: Overwrite (no history)
•Type 2: Add row with validity dates
•Type 3: Previous value column
•Type 4: History table

Incremental Processing

sql

-- Timestamp-based incremental
SELECT * FROM source
WHERE updated_at > {{ last_run_timestamp }}

-- CDC-based (Change Data Capture)
-- Captures inserts, updates, deletes from transaction log

Idempotent Loads

sql

-- Delete + Insert pattern
DELETE FROM target WHERE date_partition = '2024-01-15';
INSERT INTO target SELECT * FROM staging WHERE date_partition = '2024-01-15';

-- Merge/Upsert pattern
MERGE INTO target t
USING staging s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...

References

•references/orchestration-patterns.md - Airflow, Dagster, Prefect patterns
•references/data-quality-checks.md - Validation frameworks and rules
•references/pipeline-templates.md - Common pipeline architectures