Data Architecture
Purpose
Guide architects and platform engineers through strategic data architecture decisions for modern cloud-native data platforms.
When to Use This Skill
Invoke this skill when:
- •Designing a new data platform or modernizing legacy systems
- •Choosing between data lake, data warehouse, or data lakehouse
- •Deciding on data modeling approaches (dimensional, normalized, data vault, wide tables)
- •Evaluating centralized vs data mesh architecture
- •Selecting open table formats (Apache Iceberg, Delta Lake, Apache Hudi)
- •Designing medallion architecture (bronze, silver, gold layers)
- •Implementing data governance and cataloging
Core Concepts
1. Storage Paradigms
Three primary patterns for analytical data storage:
Data Lake: Centralized repository for raw data at scale
- •Schema-on-read, cost-optimized ($0.02-0.03/GB/month)
- •Use when: Diverse data sources, exploratory analytics, ML/AI training data
Data Warehouse: Structured repository optimized for BI
- •Schema-on-write, ACID transactions, fast queries
- •Use when: Known BI requirements, strong governance needed
Data Lakehouse: Hybrid combining lake flexibility with warehouse reliability
- •Open table formats (Iceberg, Delta Lake), ACID on object storage
- •Use when: Mixed BI + ML workloads, cost optimization (60-80% cheaper than warehouse)
Decision Framework:
- •BI/Reporting only + Known queries → Data Warehouse
- •ML/AI primary + Raw data needed → Data Lake or Lakehouse
- •Mixed BI + ML + Cost optimization → Data Lakehouse (recommended)
- •Exploratory/Unknown use cases → Data Lake
For detailed comparison, see references/storage-paradigms.md.
2. Data Modeling Approaches
Four primary modeling patterns:
Dimensional (Kimball): Star/snowflake schemas for BI
- •Use when: Known query patterns, BI dashboards, trend analysis
Normalized (3NF): Eliminate redundancy for transactional systems
- •Use when: OLTP systems, frequent updates, strong consistency
Data Vault 2.0: Flexible model with complete audit trail
- •Use when: Compliance requirements, multiple sources, agile warehousing
Wide Tables: Denormalized, optimized for columnar storage
- •Use when: ML feature stores, data science notebooks, high-performance dashboards
Decision Framework:
- •Analytical (BI) + Known queries → Dimensional (Star Schema)
- •Transactional (OLTP) → Normalized (3NF)
- •Compliance/Audit → Data Vault 2.0
- •Data Science/ML → Wide Tables
For detailed patterns, see references/modeling-approaches.md.
3. Data Mesh Principles
Decentralized architecture for large organizations (>500 people).
Four Core Principles:
- •Domain-oriented decentralization
- •Data as a product (SLAs, quality, documentation)
- •Self-serve data infrastructure
- •Federated computational governance
Readiness Assessment (Score 1-5 each):
- •Domain clarity
- •Team maturity
- •Platform capability
- •Governance maturity
- •Scale need
- •Organizational buy-in
Scoring: 24-30: Strong candidate | 18-23: Hybrid | 12-17: Build foundation first | 6-11: Centralized
Red Flags: Small org (<100 people), unclear domains, no platform team, weak governance
For full guide, see references/data-mesh-guide.md.
4. Medallion Architecture
Standard lakehouse pattern: Bronze (raw) → Silver (cleaned) → Gold (business-level)
Bronze Layer: Exact copy of source data, immutable, append-only
Silver Layer: Validated, deduplicated, typed data
Gold Layer: Business logic, aggregates, dimensional models, ML features
Data Quality by Layer:
- •Bronze → Silver: Schema validation, type checks, deduplication
- •Silver → Gold: Business rule validation, referential integrity
- •Gold: Anomaly detection, statistical checks
For patterns, see references/medallion-pattern.md.
5. Open Table Formats
Enable ACID transactions on data lakes:
Apache Iceberg: Multi-engine, vendor-neutral (Context7: 79.7 score)
- •Use when: Avoid vendor lock-in, multi-engine flexibility
Delta Lake: Databricks ecosystem, Spark-optimized
- •Use when: Committed to Databricks
Apache Hudi: Optimized for CDC and frequent upserts
- •Use when: CDC-heavy workloads
Recommendation: Apache Iceberg for new projects (vendor-neutral, broadest support)
For comparison, see references/table-formats.md.
6. Modern Data Stack
Standard Layers:
- •Ingestion: Fivetran, Airbyte, Kafka
- •Storage: Snowflake, Databricks, BigQuery
- •Transformation: dbt (Context7: 87.0 score), Spark
- •Orchestration: Airflow, Dagster, Prefect
- •Visualization: Tableau, Looker, Power BI
- •Governance: DataHub, Alation, Great Expectations
Tool Selection:
- •Fivetran vs Airbyte: Pre-built connectors vs cost-sensitive
- •Snowflake vs Databricks: BI-focused vs ML-focused
- •dbt vs Spark: SQL-based vs large-scale processing
For detailed recommendations, see references/tool-recommendations.md and references/modern-data-stack.md.
7. Data Governance
Data Catalog: Searchable inventory (DataHub, Alation, Collibra)
Data Lineage: Track data flow (OpenLineage, Marquez)
Data Quality: Validation and testing (Great Expectations, Soda, dbt tests)
Access Control:
- •RBAC: Role-based (sales_analyst role)
- •ABAC: Attribute-based (row-level security)
- •Column-level: Dynamic data masking for PII
For governance patterns, see references/governance-patterns.md.
Decision Frameworks
Framework 1: Storage Paradigm Selection
Step 1: Identify Primary Use Case
- •BI/Reporting only → Data Warehouse
- •ML/AI primary → Data Lake or Lakehouse
- •Mixed BI + ML → Data Lakehouse
- •Exploratory → Data Lake
Step 2: Evaluate Budget
- •High budget, known queries → Data Warehouse
- •Cost-sensitive, flexible → Data Lakehouse
Recommendation by Org Size:
- •Startup (<50): Data Warehouse (simplicity)
- •Growth (50-500): Data Lakehouse (balance)
- •Enterprise (>500): Hybrid or unified Lakehouse
See references/decision-frameworks.md.
Framework 2: Data Modeling Approach
Decision Tree:
- •Analytical (BI) workload → Dimensional or Wide Tables
- •Transactional (OLTP) → Normalized (3NF)
- •Compliance/Audit → Data Vault 2.0
- •Data Science/ML → Wide Tables
See references/decision-frameworks.md.
Framework 3: Data Mesh Readiness
Use 6-factor assessment. Score interpretation:
- •24-30: Proceed with data mesh
- •18-23: Hybrid approach
- •12-17: Build foundation first
- •6-11: Centralized
See references/decision-frameworks.md.
Framework 4: Open Table Format Selection
Decision Tree:
- •Multi-engine flexibility → Apache Iceberg
- •Databricks ecosystem → Delta Lake
- •Frequent upserts/CDC → Apache Hudi
Recommendation: Apache Iceberg for new projects
See references/decision-frameworks.md.
Common Scenarios
Startup Data Platform
Context: 50-person startup, PostgreSQL + MongoDB + Stripe
Recommendation:
- •Storage: BigQuery or Snowflake
- •Ingestion: Airbyte or Fivetran
- •Transformation: dbt
- •Orchestration: dbt Cloud
- •Architecture: Simple data warehouse
Enterprise Modernization
Context: Legacy Oracle warehouse, need cloud migration
Recommendation:
- •Storage: Data Lakehouse (Databricks or Snowflake with Iceberg)
- •Strategy: Incremental migration with CDC
- •Architecture: Medallion (bronze, silver, gold)
- •Cost Savings: 60-80%
Data Mesh Assessment
Context: 200-person company, 5-person central data team
Recommendation: NOT YET. Build foundation first.
- •Organization too small (<500 recommended)
- •Central team not yet bottleneck
- •Invest in self-serve platform and governance
Tool Recommendations
Research-Validated (Context7, December 2025)
dbt: Score 87.0, 3,532+ code snippets
- •SQL-based transformations, version control, testing
- •Industry standard for data transformation
Apache Iceberg: Score 79.7, 832+ code snippets
- •Open table format, multi-engine, vendor-neutral
- •Production-ready (Netflix, Apple, Adobe)
Tool Stack by Use Case:
Startup: BigQuery + Airbyte + dbt + Metabase (<$1K/month)
Growth: Snowflake + Fivetran + dbt + Airflow + Tableau ($10K-50K/month)
Enterprise: Snowflake + Databricks + Fivetran + Kafka + dbt + Airflow + Alation ($50K-500K/month)
See references/tool-recommendations.md.
Implementation Patterns
Pattern 1: Medallion Architecture
-- Bronze: Raw ingestion CREATE TABLE bronze.raw_customers (_ingested_at TIMESTAMP, _raw_data STRING); -- Silver: Cleaned CREATE TABLE silver.customers AS SELECT json_extract(_raw_data, '$.id') AS customer_id, ... FROM bronze.raw_customers QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY _ingested_at DESC) = 1; -- Gold: Business-level CREATE TABLE gold.fact_sales AS SELECT s.order_id, d.date_key, c.customer_key, ... FROM silver.sales s JOIN gold.dim_date d ON s.order_date = d.date;
Pattern 2: Apache Iceberg Table
CREATE TABLE catalog.db.sales (order_id BIGINT, amount DECIMAL(10,2)) USING iceberg PARTITIONED BY (days(order_date)); -- Time travel SELECT * FROM catalog.db.sales TIMESTAMP AS OF '2025-01-01';
Pattern 3: dbt Transformation
-- models/staging/stg_customers.sql
WITH source AS (SELECT * FROM {{ source('raw', 'customers') }}),
cleaned AS (
SELECT customer_id, UPPER(customer_name) AS customer_name
FROM source WHERE customer_id IS NOT NULL
)
SELECT * FROM cleaned
For complete examples, see examples/.
Best Practices
- •Start simple: Avoid over-engineering; begin with warehouse or basic lakehouse
- •Invest in governance early: Catalog, lineage, quality from day one
- •Medallion architecture: Use bronze-silver-gold for clear quality layers
- •Open table formats: Prefer Iceberg or Delta Lake to avoid vendor lock-in
- •Assess mesh readiness: Don't decentralize prematurely (<500 people)
- •Automate quality: Integrate tests (Great Expectations, dbt) into CI/CD
- •Monitor pipelines: Observability is critical (freshness, quality, health)
- •Document as code: Use dbt docs, DataHub, YAML for self-service
- •Incremental loading: Only load new/changed data (watermark columns)
- •Business alignment: Align architecture to outcomes, not just technologies
Anti-Patterns
- •❌ Data swamp: Lake without governance or cataloging
- •❌ Premature mesh: Mesh before organizational readiness
- •❌ Tool sprawl: Too many tools without integration
- •❌ No quality checks: "Garbage in, garbage out"
- •❌ Centralized bottleneck: Single team in large org (>500 people)
- •❌ Vendor lock-in: Proprietary formats without migration path
- •❌ No lineage: Can't answer "where did this come from?"
- •❌ Over-engineering: Complex architecture for simple use cases
Integration with Other Skills
Direct Dependencies:
- •ingesting-data: ETL/ELT mechanics, Fivetran, Airbyte implementation
- •data-transformation: dbt and Dataform detailed implementation
- •streaming-data: Kafka, Flink for real-time pipelines
Complementary:
- •databases-relational: PostgreSQL, MySQL as source systems
- •databases-document: MongoDB, DynamoDB as sources
- •ai-data-engineering: Feature stores, ML training pipelines
- •designing-distributed-systems: CAP theorem, consistency models
- •observability: Monitoring pipeline health, data quality metrics
Downstream:
- •visualizing-data: BI and dashboard patterns
- •sql-optimization: Query performance tuning
Common Workflows:
End-to-End Analytics:
data-architecture (warehouse) → ingesting-data (Fivetran) → data-transformation (dbt) → visualizing-data (Tableau)
Data Platform for AI/ML:
data-architecture (lakehouse) → ingesting-data (Kafka) → data-transformation (dbt features) → ai-data-engineering (feature store)
Further Reading
Reference Files:
- •decision-frameworks.md - All 4 decision frameworks in detail
- •storage-paradigms.md - Lake vs warehouse vs lakehouse
- •modeling-approaches.md - Dimensional, normalized, data vault, wide
- •data-mesh-guide.md - Data mesh principles and implementation
- •medallion-pattern.md - Bronze, silver, gold layers
- •table-formats.md - Iceberg, Delta Lake, Hudi comparison
- •tool-recommendations.md - Tool analysis and recommendations
- •modern-data-stack.md - Tool categories and selection
- •governance-patterns.md - Catalog, lineage, quality, access control
- •scenarios.md - Startup, enterprise, data mesh scenarios
Examples:
- •examples/dbt-project/ - dbt project with medallion architecture
External Resources:
- •Apache Iceberg: https://iceberg.apache.org/
- •dbt Documentation: https://docs.getdbt.com/
- •Data Mesh (Zhamak Dehghani): https://www.datamesh-architecture.com/
- •Databricks Medallion: https://www.databricks.com/glossary/medallion-architecture