Pipeline Orchestration
Workflow orchestration tools for data pipelines: Prefect, Dagster, and dbt. These tools handle scheduling, dependency resolution, retries, monitoring, and state management for production data pipelines.
Quick Comparison
| Tool | Paradigm | Best For | Learning Curve |
|---|---|---|---|
| Prefect | Flow-based | Pythonic workflows, quick prototypes, cloud-first | Moderate |
| Dagster | Asset-based | Data asset lineage, reproducibility, type checking | Steeper |
| dbt | SQL transformations | Analytics engineering, ELT, data warehouses | Low (SQL-focused) |
| FlowerPower | Hamilton DAGs | Lightweight batch ETL, configuration-driven pipelines | Low-Moderate |
When to Use Which?
- •
Prefect: You want Python code flexibility, Prefect Cloud UI, and quick setup. Good for general-purpose data pipelines, ETL, and API integrations.
- •
Dagster: You care about data asset observability, type safety, and reproducibility. Good for complex data platforms with clear asset dependencies.
- •
dbt: Your transformations are primarily SQL and you're building analytics marts in a data warehouse. Great for analytics engineering teams.
Skill Dependencies
Assumes familiarity with:
- •
@data-engineering-core- Polars, DuckDB, PyArrow - •
@data-engineering-storage-remote-access- Cloud storage for intermediate data
Related:
- •
@data-engineering-quality- Data validation integrated into orchestration - •
@data-engineering-observability- Monitoring and tracing - •
@data-engineering-storage-lakehouse- Delta/Iceberg for state management
Detailed Guides
Prefect
See: @data-engineering-orchestration/prefect.md
- •Flows and tasks with decorators
- •Retries, caching, and parameters
- •Prefect Cloud (serverless) vs Prefect Server (self-hosted)
- •Deployment patterns
Dagster
See: @data-engineering-orchestration/dagster.md
- •Asset-based programming model
- •Materialization and partitions
- •Type checking with Dagster types
- •Sensors and schedules
- •Integration with data platforms
dbt (Data Build Tool)
See: @data-engineering-orchestration/dbt.md
- •Projects, models, tests, snapshots, seeds
- •Jinja templating and macros
- •Data testing (schema, cardinality, custom)
- •Documentation generation
- •Package management (dbt packages)
- •Adapters (DuckDB, Postgres, Snowflake, BigQuery, Spark)
FlowerPower (Lightweight Alternative)
FlowerPower is a lightweight DAG orchestration framework built on Apache Hamilton, ideal for batch ETL and data transformation scripts without the overhead of full orchestrators.
Key characteristics:
- •Hamilton-based: Define pipelines as Python functions; DAG auto-constructed
- •Configuration-driven: YAML files for parameters and execution settings
- •Lightweight: No database, no scheduler, no state persistence (batch-only)
- •Multiple executors: synchronous, threadpool, processpool, ray, dask
- •I/O plugins: Delta Lake, DuckDB, Polars, Pandas, S3, PostgreSQL, and more
When to choose FlowerPower over Prefect/Dagster:
- •Simple batch pipelines (daily/Hourly ETL)
- •Quick prototyping that can grow
- •Teams that prefer code-first (Python functions) over YAML/UI
- •No need for sophisticated scheduling, SLA tracking, or long-running state
When NOT to use:
- •Production 24/7 workflows requiring reliability guarantees
- •Complex dependency graphs with cross-dependencies
- •Need for built-in retry policies with circuit breakers
- •Workflows requiring checkpoints and state recovery
- •Multi-team orchestration with fine-grained permissions
FlowerPower limitations vs. Prefect/Dagster:
| Feature | Prefect/Dagster | FlowerPower |
|---|---|---|
| Scheduling | Native (cron, intervals) | External (cron/systemd) |
| State persistence | Database/cloud | None (ephemeral) |
| Retry policies | Configurable per task | Per-pipeline via YAML |
| Observability | Rich UI, lineage | Basic Hamilton UI |
| Production readiness | High | Moderate (batch jobs) |
Integration with data-engineering stack:
- •Uses Polars/DuckDB for DataFrame operations (
@data-engineering-core) - •Delta Lake for ACID table formats (
@data-engineering-storage-lakehouse) - •fsspec/S3 for cloud storage (
@data-engineering-storage-remote-access) - •Pandera for data validation (
@data-engineering-quality) - •Follows medallion architecture (
@data-engineering-best-practices)
Skill reference: @flowerpower - Complete guide to FlowerPower with advanced production patterns (watermarks, data quality, incremental loads, cloud deployment).
Cloud Storage Integration
See: @data-engineering-orchestration/integrations/cloud-storage.md
- •dbt + S3/GCS via HTTPFS (DuckDB), aws_s3 extension (Postgres)
- •Configuration patterns for profiles.yml
- •Credential management best practices
Common Patterns
Retry Pattern (All Orchestrators)
# Prefect: @task(retries=3, retry_delay_seconds=60) # Dagster: @asset(retry_policy=RetryPolicy(...)) # dbt: --fail-fast flag + custom macro retry logic
Idempotency
All orchestrators assume idempotent operations - running twice should produce identical results. Design your INSERT, UPDATE, MERGE operations to be idempotent.
State Management
- •Prefect: Flow run state persisted to database/cloud
- •Dagster: Asset materialization events tracked
- •dbt: Model run status in
dbt_run_results.json; usesSELECT+INSERTby default
Dependency Management
- •Prefect: Explicit task dependencies (
task1 >> task2) - •Dagster: Asset dependencies (
@asset(depends_on=[other_asset])) - •dbt: DAG built from DAG from
ref()calls in models
Production Recommendations
- •Version control everything: Code, configs, dbt models, Prefect/Dagster definitions
- •Test locally first: Use unit tests for transformation logic, integration tests for pipeline runs
- •Use environment variables for credentials (never hardcode)
- •Monitor pipeline runs: Prefect Cloud UI, Dagster Dagit, dbt Cloud or custom alerts
- •Alert on failures: Configure email/Slack/webhook notifications
- •Log aggregation: Send orchestrator logs to centralized system (Datadog, CloudWatch)
- •Idempotent writes: Avoid duplicate data on retries
- •Schema evolution: Handle schema changes gracefully (additive only preferred)