data-warehouse
Use this skill for 大数据/数仓(DW)建设与运维:从数据接入到指标交付与治理。
Defaults / assumptions to confirm
- •Warehouse tech: BigQuery/Snowflake/Redshift/Hive/ClickHouse/etc.
- •Orchestration: Airflow/Dagster/Argo/dbt
- •Ingestion: CDC vs batch, streaming (Kafka) vs file
- •Data consumers: BI dashboards, product analytics, ML features
Core outputs
- •Warehouse architecture (sources → staging → warehouse → marts)
- •Data model (facts/dimensions) + metric definitions
- •Pipeline plan (DAGs, schedules, dependencies, SLAs)
- •Data quality plan (checks, thresholds, alerts)
- •Cost/performance plan (partitioning, clustering, materialization)
- •Governance plan (access control, PII handling, retention)
Workflow
- •Understand the business questions
- •What decisions will this warehouse support?
- •Define critical metrics and their definitions (single source of truth).
- •Source & ingestion design
- •Identify systems of record and ownership.
- •Choose ingestion: CDC for mutable OLTP, append-only logs for events.
- •Define late-arriving data strategy and backfills.
- •Modeling (practical)
- •Prefer star schema for BI: Fact tables + Dimension tables.
- •Keep grain explicit (one row represents what?).
- •Separate raw/staging from curated models; avoid mixing.
- •ETL/ELT pipelines
- •Define DAGs with clear inputs/outputs and idempotency.
- •Handle retries, partial failures, and reprocessing windows.
- •Provide backfill procedures and runbook.
- •Data quality & observability
- •Validate freshness, volume, schema drift, null ratios, referential consistency (logical).
- •Add anomaly detection for key metrics.
- •Track pipeline success rate, duration, and SLA misses.
- •Performance & cost
- •Partition by date/time for large facts; cluster by common filters/joins.
- •Materialize expensive queries (summary tables, incremental models).
- •Control scan cost with column pruning and predicate pushdown.
- •Governance & security
- •PII classification, masking, and retention.
- •RBAC/ABAC for datasets; audit logging.
- •Document lineage and ownership for each table/model.
Templates
Table spec
- •Name:
- •Grain:
- •Partition/cluster keys:
- •Key columns:
- •Sources:
- •Refresh cadence:
- •Consumers:
- •Quality checks:
- •Owner:
Metric spec
- •Name:
- •Definition:
- •Numerator/denominator:
- •Filters:
- •Time window:
- •Known caveats: