Data Engineering Best Practices
Use this skill for production architecture and standards decisions: storage layout, lifecycle, incremental semantics, schema evolution, quality checks, and cost/performance tradeoffs.
When to use this skill
Use for:
- •Designing Bronze/Silver/Gold or equivalent data layers
- •Choosing append vs overwrite vs merge behavior
- •Partitioning and file-size strategy
- •Defining schema evolution policy
- •Setting testing/observability guardrails
- •Establishing retention + cost controls
Use domain skills for implementation details:
- •
@data-engineering-core - •
@data-engineering-storage-lakehouse - •
@data-engineering-storage-formats - •
@data-engineering-storage-remote-access - •
@data-engineering-storage-authentication
Decision checklist (apply in order)
- •Data contract
- •Required columns/types?
- •Nullability and key uniqueness?
- •Layer semantics
- •Bronze immutable?
- •Silver deduplicated/validated?
- •Gold business-ready aggregates?
- •Write mode
- •Append, partition overwrite, or merge?
- •Layout
- •Partition keys + target file size set?
- •Incremental logic
- •Watermark/checkpoint strategy defined?
- •Evolution policy
- •Additive-only by default?
- •Operational controls
- •Tests + observability + retention + backfill process?
Core standards
1) Layering (Medallion)
- •Bronze: raw immutable ingestion; append-only
- •Silver: cleaned, validated, conformed schema
- •Gold: consumption-specific marts/features/aggregates
Do not skip Silver validation for convenience; silent quality drift is costly.
2) Write semantics
| Operation | Use when | Notes |
|---|---|---|
| Append | strictly new immutable events | simplest, cheapest |
| Partition overwrite | deterministic reprocessing for date/key slice | safe for backfills |
| Merge/Upsert | corrections/late updates/deletes | needs key + conflict semantics |
3) Partitioning
Good partition keys:
- •Frequently filtered dimensions (often date + low/moderate-cardinality dimension)
Avoid:
- •High-cardinality keys (e.g., user_id)
- •Over-partitioning creating tiny files
4) File sizing
Target file size: ~256MB–1GB (workload-dependent).
- •Too small → metadata/listing overhead + slow scans
- •Too large → poor parallelism and skewed processing
5) Schema evolution
Default policy:
- •✅ additive changes first (new nullable columns)
- •⚠️ type widening only when compatibility is clear
- •❌ destructive rename/drop in-place for shared production tables
6) Incremental processing
- •Persist watermark/checkpoint externally
- •Make re-runs idempotent
- •Include late-arriving data strategy (lag window/backfill)
7) Quality and reliability
Minimum controls:
- •Required columns + types
- •Primary key uniqueness (or dedupe policy)
- •Null thresholds on critical fields
- •Freshness/SLA checks
- •Run-level metrics (rows in/out, failures, latency)
Anti-patterns (reject in review)
- •Full table overwrite for small incremental changes
- •No checkpoint/watermark for recurring pipeline
- •Unbounded tiny-file generation
- •Dynamic SQL built from user values without parameter binding
- •Production credentials in code/config committed to repo
- •No backfill plan / no rollback strategy
Minimal production blueprint
- •Ingest raw to Bronze (append-only)
- •Validate + standardize to Silver
- •Build Gold outputs
- •Emit metrics + quality report
- •Persist checkpoint/watermark
- •Apply lifecycle rules + periodic compaction/maintenance
Progressive disclosure (read next as needed)
- •
best-practices-detailed.md— comprehensive deep-dive examples - •
@data-engineering-core/patterns/incremental.md— incremental loading patterns - •
@data-engineering-storage-lakehouse— Delta/Iceberg/Hudi-specific behavior - •
@data-engineering-quality— validation framework implementation - •
@data-engineering-observability— metrics/tracing/alerting