ELT Modeling
ELT vs ETL
ETL (Traditional)
Extract → Transform → Load 1. Extract from source 2. Transform in ETL tool (Informatica, Talend) 3. Load to warehouse Pros: Clean data before loading Cons: Slow, expensive transformation servers
ELT (Modern)
Extract → Load → Transform 1. Extract from source 2. Load raw data to warehouse 3. Transform in warehouse (SQL, dbt) Pros: Fast loading, leverage warehouse power Cons: Raw data in warehouse (need governance)
Why ELT?
- •Cloud warehouses: Snowflake, BigQuery, Redshift (powerful, scalable)
- •Separation of concerns: EL (Fivetran, Airbyte) + T (dbt)
- •Flexibility: Transform multiple ways from same raw data
- •Speed: Parallel processing in warehouse
Dimensional Modeling
Star Schema
dim_customers
|
dim_products
|
fct_orders ← Central fact table
|
dim_dates
|
dim_locations
Characteristics:
- •One fact table (center)
- •Multiple dimension tables (points)
- •Denormalized (fast queries)
Snowflake Schema
dim_customers
|
dim_customer_segments
|
fct_orders
|
dim_products
|
dim_product_categories
Characteristics:
- •Normalized dimensions
- •Less redundancy
- •More joins (slower queries)
Recommendation: Use star schema for analytics (faster)
Fact Tables
Definition
Fact table: Stores measurable events (transactions, orders, clicks)
Characteristics
- •Large: Millions to billions of rows
- •Numeric measures: Amounts, quantities, counts
- •Foreign keys: Links to dimensions
- •Grain: Level of detail (one row per order, per day, etc.)
Types of Facts
Transaction Facts:
-- One row per transaction fct_orders: order_id (PK) customer_id (FK) product_id (FK) order_date_id (FK) quantity amount
Periodic Snapshot Facts:
-- One row per period (daily, monthly) fct_inventory_daily: date_id (PK) product_id (PK) warehouse_id (PK) quantity_on_hand quantity_sold
Accumulating Snapshot Facts:
-- One row per process (order lifecycle) fct_order_lifecycle: order_id (PK) order_date payment_date shipment_date delivery_date days_to_ship days_to_deliver
Fact Table Design
CREATE TABLE fct_orders ( -- Surrogate key order_key BIGINT PRIMARY KEY, -- Natural key order_id VARCHAR(50) NOT NULL, -- Foreign keys (dimensions) customer_key BIGINT NOT NULL, product_key BIGINT NOT NULL, date_key INT NOT NULL, -- Degenerate dimensions (no separate dim table) order_number VARCHAR(50), -- Measures quantity INT, unit_price DECIMAL(10,2), total_amount DECIMAL(10,2), discount_amount DECIMAL(10,2), -- Audit columns created_at TIMESTAMP, updated_at TIMESTAMP );
Dimension Tables
Definition
Dimension table: Stores descriptive attributes (who, what, where, when)
Characteristics
- •Small: Thousands to millions of rows
- •Descriptive: Text, categories, hierarchies
- •Slowly changing: Updates over time (SCD)
Dimension Design
CREATE TABLE dim_customers ( -- Surrogate key customer_key BIGINT PRIMARY KEY, -- Natural key customer_id VARCHAR(50) NOT NULL, -- Attributes customer_name VARCHAR(255), email VARCHAR(255), phone VARCHAR(50), -- Hierarchies city VARCHAR(100), state VARCHAR(100), country VARCHAR(100), region VARCHAR(100), -- Segments customer_segment VARCHAR(50), customer_tier VARCHAR(50), -- SCD Type 2 columns effective_date DATE, expiration_date DATE, is_current BOOLEAN, -- Audit columns created_at TIMESTAMP, updated_at TIMESTAMP );
Slowly Changing Dimensions (SCD)
Type 1: Overwrite
-- Customer moves, update address UPDATE dim_customers SET city = 'New York', state = 'NY' WHERE customer_id = '123'; -- No history preserved
Type 2: Add New Row (Historical)
-- Customer moves, add new row
INSERT INTO dim_customers (
customer_id, city, state,
effective_date, is_current
) VALUES (
'123', 'New York', 'NY',
'2024-01-15', TRUE
);
-- Mark old row as not current
UPDATE dim_customers
SET is_current = FALSE,
expiration_date = '2024-01-14'
WHERE customer_id = '123'
AND is_current = TRUE;
-- History preserved
Type 3: Add New Column
-- Track previous value
ALTER TABLE dim_customers
ADD COLUMN previous_city VARCHAR(100);
UPDATE dim_customers
SET previous_city = city,
city = 'New York'
WHERE customer_id = '123';
Grain Definition
What is Grain?
Grain: Level of detail in fact table (one row represents what?)
Examples
Order grain: One row per order Order line grain: One row per product in order Daily grain: One row per day per product Hourly grain: One row per hour per product
Choosing Grain
Too fine (order line): - More rows - More storage - More flexibility Too coarse (monthly): - Fewer rows - Less storage - Less flexibility Balance: Choose finest grain needed for analysis
Kimball vs Inmon
Kimball (Bottom-Up, Dimensional)
Data Marts (Star Schemas) → Enterprise Data Warehouse Pros: - Fast to implement - Business-friendly (denormalized) - Query performance Cons: - Data redundancy - Hard to change
Inmon (Top-Down, Normalized)
Enterprise Data Warehouse (3NF) → Data Marts Pros: - Single source of truth - Flexible - Less redundancy Cons: - Slow to implement - Complex queries (many joins)
Modern Approach: Hybrid (normalized staging, dimensional marts)
Data Vault
Components
Hubs: Business keys (customers, products) Links: Relationships (orders link customers + products) Satellites: Descriptive attributes (customer details)
Example
-- Hub: Customer CREATE TABLE hub_customer ( customer_hub_key BIGINT PRIMARY KEY, customer_id VARCHAR(50) UNIQUE, load_date TIMESTAMP, record_source VARCHAR(50) ); -- Satellite: Customer Details CREATE TABLE sat_customer_details ( customer_hub_key BIGINT, load_date TIMESTAMP, customer_name VARCHAR(255), email VARCHAR(255), phone VARCHAR(50), PRIMARY KEY (customer_hub_key, load_date) ); -- Link: Order CREATE TABLE link_order ( order_link_key BIGINT PRIMARY KEY, customer_hub_key BIGINT, product_hub_key BIGINT, load_date TIMESTAMP );
Pros:
- •Audit trail
- •Flexible
- •Handles source changes
Cons:
- •Complex
- •Many joins
- •Steep learning curve
Conformed Dimensions
Definition
Conformed dimension: Shared across multiple fact tables
Example
-- dim_date used by multiple facts
dim_date ← fct_orders
← fct_shipments
← fct_returns
-- Ensures consistent date attributes
Benefits
- •Consistency: Same date logic everywhere
- •Drill-across: Compare metrics across facts
- •Reusability: Build once, use many times
Bridge Tables
Many-to-Many Relationships
-- Customer can have multiple accounts -- Account can have multiple customers CREATE TABLE bridge_customer_account ( customer_key BIGINT, account_key BIGINT, allocation_percentage DECIMAL(5,2), PRIMARY KEY (customer_key, account_key) );
Junk Dimensions
Definition
Junk dimension: Collection of low-cardinality flags
Example
-- Instead of many boolean columns in fact CREATE TABLE dim_order_flags ( order_flag_key INT PRIMARY KEY, is_gift BOOLEAN, is_express_shipping BOOLEAN, is_first_order BOOLEAN, has_discount BOOLEAN ); -- Fact table references junk dimension fct_orders: order_key customer_key order_flag_key ← Reference to junk dimension amount
Role-Playing Dimensions
Definition
Role-playing dimension: Same dimension used multiple times with different meanings
Example
-- dim_date used multiple times fct_orders: order_key order_date_key ← dim_date (order date) ship_date_key ← dim_date (ship date) delivery_date_key ← dim_date (delivery date) amount
Surrogate Keys
Natural Key vs Surrogate Key
-- Natural key: Business identifier customer_id = 'CUST-12345' -- Surrogate key: System-generated customer_key = 1001 (auto-increment or hash)
Why Surrogate Keys?
- •Performance: Integer joins faster than string joins
- •SCD Type 2: Multiple rows for same natural key
- •Independence: Source system can change natural key
- •Consistency: Same format across all tables
Generating Surrogate Keys
-- Auto-increment
customer_key BIGINT AUTO_INCREMENT PRIMARY KEY
-- Hash (dbt)
{{ dbt_utils.surrogate_key(['customer_id', 'effective_date']) }}
-- Sequence (PostgreSQL)
customer_key BIGINT DEFAULT nextval('customer_key_seq')
Data Warehouse Layers
Bronze (Raw)
Purpose: Exact copy of source data Format: As-is from source Schema: Source schema Example: raw_salesforce_accounts
Silver (Cleaned)
Purpose: Cleaned, standardized Format: Consistent types, naming Schema: Staging schema Example: stg_salesforce_accounts
Gold (Analytics)
Purpose: Business-ready Format: Dimensional models Schema: Analytics schema Example: dim_customers, fct_orders
Best Practices
1. Define Grain Clearly
✓ "One row per order" ✗ "Order data"
2. Use Surrogate Keys
✓ customer_key BIGINT ✗ customer_id VARCHAR(50)
3. Denormalize Dimensions
✓ dim_customers includes city, state, country ✗ Separate dim_cities, dim_states, dim_countries
4. Keep Facts Narrow
✓ fct_orders: Keys + measures only ✗ fct_orders: Keys + measures + customer name, product name
5. Use Conformed Dimensions
✓ dim_date shared across all facts ✗ Each fact has own date dimension
Common Patterns
Daily Snapshot
-- Capture state once per day INSERT INTO fct_inventory_daily SELECT CURRENT_DATE as snapshot_date, product_id, warehouse_id, quantity_on_hand, quantity_reserved FROM current_inventory;
Cumulative Metrics
-- Running totals
SELECT
order_date,
SUM(amount) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as cumulative_revenue
FROM fct_orders;
Summary
ELT: Extract → Load → Transform (modern approach)
Dimensional Modeling:
- •Star schema (recommended)
- •Fact tables (measures)
- •Dimension tables (attributes)
Fact Types:
- •Transaction (one per event)
- •Periodic snapshot (one per period)
- •Accumulating snapshot (one per process)
SCD Types:
- •Type 1: Overwrite
- •Type 2: Historical (recommended)
- •Type 3: Previous value
Grain: Level of detail (define clearly!)
Surrogate Keys: Use for performance and SCD
Layers:
- •Bronze: Raw
- •Silver: Cleaned
- •Gold: Analytics-ready
Best Practices:
- •Define grain
- •Use surrogate keys
- •Denormalize dimensions
- •Keep facts narrow
- •Use conformed dimensions