AgentSkillsCN

ELT Modeling

ELT(提取、加载、转换)建模模式、维度建模、事实表与维度表设计,以及数据仓库架构综合指南

SKILL.md
--- frontmatter
name: ELT Modeling
description: Comprehensive guide to ELT (Extract, Load, Transform) modeling patterns, dimensional modeling, fact and dimension tables, and data warehouse design

ELT Modeling

ELT vs ETL

ETL (Traditional)

code
Extract → Transform → Load

1. Extract from source
2. Transform in ETL tool (Informatica, Talend)
3. Load to warehouse

Pros: Clean data before loading
Cons: Slow, expensive transformation servers

ELT (Modern)

code
Extract → Load → Transform

1. Extract from source
2. Load raw data to warehouse
3. Transform in warehouse (SQL, dbt)

Pros: Fast loading, leverage warehouse power
Cons: Raw data in warehouse (need governance)

Why ELT?

  • Cloud warehouses: Snowflake, BigQuery, Redshift (powerful, scalable)
  • Separation of concerns: EL (Fivetran, Airbyte) + T (dbt)
  • Flexibility: Transform multiple ways from same raw data
  • Speed: Parallel processing in warehouse

Dimensional Modeling

Star Schema

code
        dim_customers
              |
        dim_products
              |
         fct_orders  ← Central fact table
              |
        dim_dates
              |
        dim_locations

Characteristics:

  • One fact table (center)
  • Multiple dimension tables (points)
  • Denormalized (fast queries)

Snowflake Schema

code
    dim_customers
          |
    dim_customer_segments
          |
     fct_orders
          |
    dim_products
          |
    dim_product_categories

Characteristics:

  • Normalized dimensions
  • Less redundancy
  • More joins (slower queries)

Recommendation: Use star schema for analytics (faster)


Fact Tables

Definition

Fact table: Stores measurable events (transactions, orders, clicks)

Characteristics

  • Large: Millions to billions of rows
  • Numeric measures: Amounts, quantities, counts
  • Foreign keys: Links to dimensions
  • Grain: Level of detail (one row per order, per day, etc.)

Types of Facts

Transaction Facts:

sql
-- One row per transaction
fct_orders:
  order_id (PK)
  customer_id (FK)
  product_id (FK)
  order_date_id (FK)
  quantity
  amount

Periodic Snapshot Facts:

sql
-- One row per period (daily, monthly)
fct_inventory_daily:
  date_id (PK)
  product_id (PK)
  warehouse_id (PK)
  quantity_on_hand
  quantity_sold

Accumulating Snapshot Facts:

sql
-- One row per process (order lifecycle)
fct_order_lifecycle:
  order_id (PK)
  order_date
  payment_date
  shipment_date
  delivery_date
  days_to_ship
  days_to_deliver

Fact Table Design

sql
CREATE TABLE fct_orders (
  -- Surrogate key
  order_key BIGINT PRIMARY KEY,
  
  -- Natural key
  order_id VARCHAR(50) NOT NULL,
  
  -- Foreign keys (dimensions)
  customer_key BIGINT NOT NULL,
  product_key BIGINT NOT NULL,
  date_key INT NOT NULL,
  
  -- Degenerate dimensions (no separate dim table)
  order_number VARCHAR(50),
  
  -- Measures
  quantity INT,
  unit_price DECIMAL(10,2),
  total_amount DECIMAL(10,2),
  discount_amount DECIMAL(10,2),
  
  -- Audit columns
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);

Dimension Tables

Definition

Dimension table: Stores descriptive attributes (who, what, where, when)

Characteristics

  • Small: Thousands to millions of rows
  • Descriptive: Text, categories, hierarchies
  • Slowly changing: Updates over time (SCD)

Dimension Design

sql
CREATE TABLE dim_customers (
  -- Surrogate key
  customer_key BIGINT PRIMARY KEY,
  
  -- Natural key
  customer_id VARCHAR(50) NOT NULL,
  
  -- Attributes
  customer_name VARCHAR(255),
  email VARCHAR(255),
  phone VARCHAR(50),
  
  -- Hierarchies
  city VARCHAR(100),
  state VARCHAR(100),
  country VARCHAR(100),
  region VARCHAR(100),
  
  -- Segments
  customer_segment VARCHAR(50),
  customer_tier VARCHAR(50),
  
  -- SCD Type 2 columns
  effective_date DATE,
  expiration_date DATE,
  is_current BOOLEAN,
  
  -- Audit columns
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);

Slowly Changing Dimensions (SCD)

Type 1: Overwrite

sql
-- Customer moves, update address
UPDATE dim_customers
SET city = 'New York', state = 'NY'
WHERE customer_id = '123';

-- No history preserved

Type 2: Add New Row (Historical)

sql
-- Customer moves, add new row
INSERT INTO dim_customers (
  customer_id, city, state,
  effective_date, is_current
) VALUES (
  '123', 'New York', 'NY',
  '2024-01-15', TRUE
);

-- Mark old row as not current
UPDATE dim_customers
SET is_current = FALSE,
    expiration_date = '2024-01-14'
WHERE customer_id = '123'
  AND is_current = TRUE;

-- History preserved

Type 3: Add New Column

sql
-- Track previous value
ALTER TABLE dim_customers
ADD COLUMN previous_city VARCHAR(100);

UPDATE dim_customers
SET previous_city = city,
    city = 'New York'
WHERE customer_id = '123';

Grain Definition

What is Grain?

Grain: Level of detail in fact table (one row represents what?)

Examples

code
Order grain:      One row per order
Order line grain: One row per product in order
Daily grain:      One row per day per product
Hourly grain:     One row per hour per product

Choosing Grain

code
Too fine (order line):
- More rows
- More storage
- More flexibility

Too coarse (monthly):
- Fewer rows
- Less storage
- Less flexibility

Balance: Choose finest grain needed for analysis

Kimball vs Inmon

Kimball (Bottom-Up, Dimensional)

code
Data Marts (Star Schemas) → Enterprise Data Warehouse

Pros:
- Fast to implement
- Business-friendly (denormalized)
- Query performance

Cons:
- Data redundancy
- Hard to change

Inmon (Top-Down, Normalized)

code
Enterprise Data Warehouse (3NF) → Data Marts

Pros:
- Single source of truth
- Flexible
- Less redundancy

Cons:
- Slow to implement
- Complex queries (many joins)

Modern Approach: Hybrid (normalized staging, dimensional marts)


Data Vault

Components

code
Hubs:       Business keys (customers, products)
Links:      Relationships (orders link customers + products)
Satellites: Descriptive attributes (customer details)

Example

sql
-- Hub: Customer
CREATE TABLE hub_customer (
  customer_hub_key BIGINT PRIMARY KEY,
  customer_id VARCHAR(50) UNIQUE,
  load_date TIMESTAMP,
  record_source VARCHAR(50)
);

-- Satellite: Customer Details
CREATE TABLE sat_customer_details (
  customer_hub_key BIGINT,
  load_date TIMESTAMP,
  customer_name VARCHAR(255),
  email VARCHAR(255),
  phone VARCHAR(50),
  PRIMARY KEY (customer_hub_key, load_date)
);

-- Link: Order
CREATE TABLE link_order (
  order_link_key BIGINT PRIMARY KEY,
  customer_hub_key BIGINT,
  product_hub_key BIGINT,
  load_date TIMESTAMP
);

Pros:

  • Audit trail
  • Flexible
  • Handles source changes

Cons:

  • Complex
  • Many joins
  • Steep learning curve

Conformed Dimensions

Definition

Conformed dimension: Shared across multiple fact tables

Example

sql
-- dim_date used by multiple facts
dim_date ← fct_orders
        ← fct_shipments
        ← fct_returns

-- Ensures consistent date attributes

Benefits

  • Consistency: Same date logic everywhere
  • Drill-across: Compare metrics across facts
  • Reusability: Build once, use many times

Bridge Tables

Many-to-Many Relationships

sql
-- Customer can have multiple accounts
-- Account can have multiple customers

CREATE TABLE bridge_customer_account (
  customer_key BIGINT,
  account_key BIGINT,
  allocation_percentage DECIMAL(5,2),
  PRIMARY KEY (customer_key, account_key)
);

Junk Dimensions

Definition

Junk dimension: Collection of low-cardinality flags

Example

sql
-- Instead of many boolean columns in fact
CREATE TABLE dim_order_flags (
  order_flag_key INT PRIMARY KEY,
  is_gift BOOLEAN,
  is_express_shipping BOOLEAN,
  is_first_order BOOLEAN,
  has_discount BOOLEAN
);

-- Fact table references junk dimension
fct_orders:
  order_key
  customer_key
  order_flag_key  ← Reference to junk dimension
  amount

Role-Playing Dimensions

Definition

Role-playing dimension: Same dimension used multiple times with different meanings

Example

sql
-- dim_date used multiple times
fct_orders:
  order_key
  order_date_key      ← dim_date (order date)
  ship_date_key       ← dim_date (ship date)
  delivery_date_key   ← dim_date (delivery date)
  amount

Surrogate Keys

Natural Key vs Surrogate Key

sql
-- Natural key: Business identifier
customer_id = 'CUST-12345'

-- Surrogate key: System-generated
customer_key = 1001 (auto-increment or hash)

Why Surrogate Keys?

  • Performance: Integer joins faster than string joins
  • SCD Type 2: Multiple rows for same natural key
  • Independence: Source system can change natural key
  • Consistency: Same format across all tables

Generating Surrogate Keys

sql
-- Auto-increment
customer_key BIGINT AUTO_INCREMENT PRIMARY KEY

-- Hash (dbt)
{{ dbt_utils.surrogate_key(['customer_id', 'effective_date']) }}

-- Sequence (PostgreSQL)
customer_key BIGINT DEFAULT nextval('customer_key_seq')

Data Warehouse Layers

Bronze (Raw)

code
Purpose: Exact copy of source data
Format: As-is from source
Schema: Source schema
Example: raw_salesforce_accounts

Silver (Cleaned)

code
Purpose: Cleaned, standardized
Format: Consistent types, naming
Schema: Staging schema
Example: stg_salesforce_accounts

Gold (Analytics)

code
Purpose: Business-ready
Format: Dimensional models
Schema: Analytics schema
Example: dim_customers, fct_orders

Best Practices

1. Define Grain Clearly

code
✓ "One row per order"
✗ "Order data"

2. Use Surrogate Keys

code
✓ customer_key BIGINT
✗ customer_id VARCHAR(50)

3. Denormalize Dimensions

code
✓ dim_customers includes city, state, country
✗ Separate dim_cities, dim_states, dim_countries

4. Keep Facts Narrow

code
✓ fct_orders: Keys + measures only
✗ fct_orders: Keys + measures + customer name, product name

5. Use Conformed Dimensions

code
✓ dim_date shared across all facts
✗ Each fact has own date dimension

Common Patterns

Daily Snapshot

sql
-- Capture state once per day
INSERT INTO fct_inventory_daily
SELECT
  CURRENT_DATE as snapshot_date,
  product_id,
  warehouse_id,
  quantity_on_hand,
  quantity_reserved
FROM current_inventory;

Cumulative Metrics

sql
-- Running totals
SELECT
  order_date,
  SUM(amount) OVER (
    ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) as cumulative_revenue
FROM fct_orders;

Summary

ELT: Extract → Load → Transform (modern approach)

Dimensional Modeling:

  • Star schema (recommended)
  • Fact tables (measures)
  • Dimension tables (attributes)

Fact Types:

  • Transaction (one per event)
  • Periodic snapshot (one per period)
  • Accumulating snapshot (one per process)

SCD Types:

  • Type 1: Overwrite
  • Type 2: Historical (recommended)
  • Type 3: Previous value

Grain: Level of detail (define clearly!)

Surrogate Keys: Use for performance and SCD

Layers:

  • Bronze: Raw
  • Silver: Cleaned
  • Gold: Analytics-ready

Best Practices:

  • Define grain
  • Use surrogate keys
  • Denormalize dimensions
  • Keep facts narrow
  • Use conformed dimensions