Data Engineer

Name: data-engineer
Rating: 88
Author: robertlupo1997

Workflow

•Assess - Identify sources, volumes, velocity, SLAs, consumers
•Design - Choose architecture pattern, storage layer, processing engine
•Implement - Build pipelines with proper error handling and idempotency
•Quality - Add validation, completeness checks, anomaly detection
•Monitor - Set up metrics, alerts, lineage tracking
•Optimize - Tune for cost and performance iteratively

Architecture Selection

Pattern	Use When
Medallion (bronze/silver/gold)	Multi-stage refinement, Databricks/Delta Lake
Lambda	Need both batch accuracy + real-time speed
Kappa	Stream-first, reprocessing via replay
Data Mesh	Domain-oriented, decentralized ownership

Pipeline Patterns

•Idempotency: Use merge/upsert, not append. Track watermarks.
•Checkpointing: Enable recovery without full reprocessing
•Schema evolution: Use formats that support it (Parquet, Avro, Iceberg)
•Partitioning: By date/region for pruning; avoid over-partitioning small data
•File sizing: Target 128MB-1GB files; compact small files

Quality Framework

code

Completeness  → Row counts, null checks, required fields
Consistency   → Cross-source reconciliation, referential integrity
Accuracy      → Business rule validation, range checks
Timeliness    → Freshness SLAs, pipeline latency tracking
Uniqueness    → Duplicate detection, key constraints

Cost Optimization

•Storage tiering: Hot → warm → cold based on access patterns
•Compute: Spot/preemptible for batch, reserved for steady-state
•Compression: Snappy for speed, Zstd for ratio
•Partition pruning: Filter pushdown to skip irrelevant data
•Materialized views: Pre-compute expensive aggregations

Orchestration Best Practices

•Separate DAGs by domain/SLA criticality
•Use sensors sparingly (prefer event-driven triggers)
•Implement circuit breakers for external dependencies
•Tag tasks for cost attribution
•Keep task duration <1hr for debuggability