dbt Patterns

What is dbt?

dbt (data build tool): SQL-first transformation tool that enables analytics engineers to transform data in the warehouse using SELECT statements.

Core Concept

code

Raw Data (Extract & Load) → dbt (Transform) → Analytics-Ready Data

Traditional ETL: Extract → Transform → Load
Modern ELT: Extract → Load → Transform (with dbt)

Why dbt?

•Version control: SQL as code (Git)
•Testing: Built-in data quality tests
•Documentation: Auto-generated docs
•Modularity: Reusable models
•Lineage: Visual data lineage
•Collaboration: Team workflows

dbt Project Structure

Standard Layout

code

my_dbt_project/
├── dbt_project.yml          # Project configuration
├── profiles.yml             # Connection profiles
├── models/                  # SQL models
│   ├── staging/            # Raw data cleaning
│   ├── intermediate/       # Business logic
│   └── marts/              # Final analytics tables
├── tests/                   # Custom tests
├── macros/                  # Reusable SQL
├── seeds/                   # CSV reference data
├── snapshots/               # SCD Type 2
└── analyses/                # Ad-hoc queries

dbt_project.yml

yaml

name: 'my_project'
version: '1.0.0'
config-version: 2

profile: 'my_profile'

model-paths: ["models"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

models:
  my_project:
    staging:
      +materialized: view
      +schema: staging
    intermediate:
      +materialized: ephemeral
    marts:
      +materialized: table
      +schema: analytics

Model Layers (Staging → Intermediate → Marts)

Staging Layer

Purpose: Clean and standardize raw data

Pattern:

sql

-- models/staging/stg_orders.sql
with source as (
    select * from {{ source('raw', 'orders') }}
),

renamed as (
    select
        id as order_id,
        user_id,
        created_at as order_created_at,
        status as order_status,
        total_amount
    from source
)

select * from renamed

Best Practices:

•One staging model per source table
•Rename columns to consistent naming
•Cast data types
•No business logic
•Materialized as views (lightweight)

Intermediate Layer

Purpose: Business logic and transformations

Pattern:

sql

-- models/intermediate/int_orders_with_customer.sql
with orders as (
    select * from {{ ref('stg_orders') }}
),

customers as (
    select * from {{ ref('stg_customers') }}
),

joined as (
    select
        orders.*,
        customers.customer_name,
        customers.customer_segment
    from orders
    left join customers
        on orders.user_id = customers.customer_id
)

select * from joined

Best Practices:

•Complex joins
•Business logic
•Calculations
•Materialized as ephemeral (not persisted)

Marts Layer

Purpose: Final analytics-ready tables

Pattern:

sql

-- models/marts/fct_orders.sql
with orders as (
    select * from {{ ref('int_orders_with_customer') }}
),

aggregated as (
    select
        order_id,
        user_id,
        customer_name,
        order_created_at,
        order_status,
        total_amount,
        case
            when order_status = 'completed' then total_amount
            else 0
        end as completed_revenue
    from orders
)

select * from aggregated

Best Practices:

•Business-friendly naming
•Denormalized for analytics
•Materialized as tables (fast queries)
•Documented

Naming Conventions

Model Naming

code

Staging:      stg_<source>_<table>
              stg_salesforce_accounts
              stg_stripe_payments

Intermediate: int_<entity>_<verb>
              int_orders_joined
              int_customers_enriched

Facts:        fct_<entity>
              fct_orders
              fct_revenue

Dimensions:   dim_<entity>
              dim_customers
              dim_products

Column Naming

code

IDs:          <entity>_id
              customer_id, order_id

Dates:        <entity>_<verb>_at
              order_created_at, customer_updated_at

Booleans:     is_<condition> or has_<condition>
              is_active, has_subscription

Amounts:      <entity>_amount
              order_amount, refund_amount

Materializations

View

sql

{{ config(materialized='view') }}

select * from {{ ref('stg_orders') }}

•Pros: Always fresh, no storage
•Cons: Slow queries (recomputed each time)
•Use for: Staging models, rarely queried

Table

sql

{{ config(materialized='table') }}

select * from {{ ref('fct_orders') }}

•Pros: Fast queries
•Cons: Storage cost, stale data
•Use for: Marts, frequently queried

Incremental

sql

{{ config(
    materialized='incremental',
    unique_key='order_id'
) }}

select * from {{ ref('stg_orders') }}

{% if is_incremental() %}
    where order_created_at > (select max(order_created_at) from {{ this }})
{% endif %}

•Pros: Fast builds, handles large data
•Cons: Complex logic
•Use for: Large fact tables

Ephemeral

sql

{{ config(materialized='ephemeral') }}

select * from {{ ref('stg_orders') }}

•Pros: No storage, CTE in downstream models
•Cons: Recomputed in each downstream model
•Use for: Intermediate models

Testing

Schema Tests

yaml

# models/schema.yml
version: 2

models:
  - name: fct_orders
    description: "Order facts table"
    columns:
      - name: order_id
        description: "Unique order identifier"
        tests:
          - unique
          - not_null
      
      - name: user_id
        description: "Customer who placed order"
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id
      
      - name: order_status
        tests:
          - accepted_values:
              values: ['pending', 'completed', 'cancelled']
      
      - name: total_amount
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: ">= 0"

Custom Tests

sql

-- tests/assert_positive_revenue.sql
select *
from {{ ref('fct_orders') }}
where total_amount < 0

Test Types

•unique: No duplicates
•not_null: No nulls
•accepted_values: Value in list
•relationships: Foreign key check
•custom: SQL query returns 0 rows

Macros

Reusable SQL

sql

-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
    ({{ column_name }} / 100.0)::decimal(10,2)
{% endmacro %}

Usage:

sql

select
    order_id,
    {{ cents_to_dollars('total_amount_cents') }} as total_amount_dollars
from {{ ref('stg_orders') }}

Common Macros

sql

-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) %}
    {%- if custom_schema_name is none -%}
        {{ target.schema }}
    {%- else -%}
        {{ custom_schema_name | trim }}
    {%- endif -%}
{% endmacro %}

Sources

Defining Sources

yaml

# models/staging/sources.yml
version: 2

sources:
  - name: raw
    database: analytics_db
    schema: raw_data
    tables:
      - name: orders
        description: "Raw orders from production DB"
        columns:
          - name: id
            tests:
              - unique
              - not_null
        
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}
        
        loaded_at_field: _loaded_at

Using Sources

sql

select * from {{ source('raw', 'orders') }}

Source Freshness

bash

dbt source freshness

Documentation

Model Documentation

yaml

# models/schema.yml
version: 2

models:
  - name: fct_orders
    description: |
      Order facts table containing all orders with customer information.
      
      This table is updated daily at 2 AM UTC.
      
      **Business Rules:**
      - Only includes orders with status 'completed' or 'pending'
      - Cancelled orders are excluded
      
    columns:
      - name: order_id
        description: "Unique identifier for each order"
      
      - name: total_amount
        description: "Total order amount in dollars"

Generate Docs

bash

dbt docs generate
dbt docs serve

Output: Interactive documentation website with:

•Model descriptions
•Column descriptions
•Data lineage (DAG)
•Source freshness

Snapshots (SCD Type 2)

Snapshot Configuration

sql

-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}

{{
    config(
      target_schema='snapshots',
      unique_key='customer_id',
      strategy='timestamp',
      updated_at='updated_at'
    )
}}

select * from {{ source('raw', 'customers') }}

{% endsnapshot %}

Run Snapshots

bash

dbt snapshot

Output

code

customer_id | name  | updated_at | dbt_valid_from | dbt_valid_to | dbt_scd_id
1           | John  | 2024-01-01 | 2024-01-01     | 2024-01-15   | abc123
1           | John  | 2024-01-15 | 2024-01-15     | null         | def456

Incremental Models

Basic Incremental

sql

{{ config(
    materialized='incremental',
    unique_key='order_id'
) }}

select
    order_id,
    user_id,
    order_created_at,
    total_amount
from {{ ref('stg_orders') }}

{% if is_incremental() %}
    -- Only process new/updated records
    where order_created_at > (select max(order_created_at) from {{ this }})
{% endif %}

Incremental with Delete+Insert

sql

{{ config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='delete+insert'
) }}

Incremental with Merge

sql

{{ config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge',
    merge_update_columns=['order_status', 'total_amount']
) }}

Packages

Installing Packages

yaml

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.0
  
  - package: calogica/dbt_expectations
    version: 0.9.0
  
  - package: dbt-labs/codegen
    version: 0.11.0

bash

dbt deps

Using Packages

sql

-- dbt_utils
select
    {{ dbt_utils.surrogate_key(['order_id', 'user_id']) }} as unique_key,
    *
from {{ ref('stg_orders') }}

-- dbt_expectations
tests:
  - dbt_expectations.expect_column_values_to_be_between:
      min_value: 0
      max_value: 1000000

Production Workflows

Development Workflow

bash

# 1. Create feature branch
git checkout -b feature/new-model

# 2. Develop model
# Edit models/marts/fct_new_model.sql

# 3. Run model
dbt run --select fct_new_model

# 4. Test model
dbt test --select fct_new_model

# 5. Document model
# Edit models/schema.yml

# 6. Commit and push
git add .
git commit -m "Add new model"
git push origin feature/new-model

# 7. Create PR
# Review, approve, merge

CI/CD Pipeline

yaml

# .github/workflows/dbt_ci.yml
name: dbt CI

on: [pull_request]

jobs:
  dbt_run:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Setup Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'
      
      - name: Install dbt
        run: pip install dbt-snowflake
      
      - name: Run dbt
        run: |
          dbt deps
          dbt run --select state:modified+ --defer --state ./prod_manifest
          dbt test --select state:modified+ --defer --state ./prod_manifest

Production Deployment

bash

# Daily production run
dbt run --target prod
dbt test --target prod
dbt source freshness --target prod

Performance Optimization

Optimize Queries

sql

-- Bad: Multiple CTEs with same source
with orders_1 as (
    select * from {{ ref('stg_orders') }}
    where status = 'completed'
),
orders_2 as (
    select * from {{ ref('stg_orders') }}
    where status = 'pending'
)

-- Good: Single CTE, filter later
with orders as (
    select * from {{ ref('stg_orders') }}
    where status in ('completed', 'pending')
),
completed_orders as (
    select * from orders where status = 'completed'
),
pending_orders as (
    select * from orders where status = 'pending'
)

Use Incremental Models

sql

-- For large tables (millions of rows)
{{ config(materialized='incremental') }}

Partition Tables

sql

{{ config(
    materialized='table',
    partition_by={
        "field": "order_date",
        "data_type": "date",
        "granularity": "day"
    }
) }}

Best Practices

1. One Model Per File

code

✓ models/staging/stg_orders.sql
✗ models/staging/all_staging_models.sql

2. Use CTEs, Not Subqueries

sql

-- Good
with orders as (
    select * from {{ ref('stg_orders') }}
)
select * from orders

-- Bad
select * from (
    select * from {{ ref('stg_orders') }}
) as orders

3. Explicit Column Selection

sql

-- Good
select
    order_id,
    user_id,
    total_amount
from {{ ref('stg_orders') }}

-- Bad
select * from {{ ref('stg_orders') }}

4. Test Everything

yaml

# Every model should have tests
models:
  - name: fct_orders
    columns:
      - name: order_id
        tests: [unique, not_null]

5. Document Everything

yaml

# Every model and column should have description
models:
  - name: fct_orders
    description: "Order facts table"
    columns:
      - name: order_id
        description: "Unique order ID"

Common Patterns

Slowly Changing Dimensions (SCD)

sql

-- Type 1: Overwrite
{{ config(materialized='table') }}
select * from {{ ref('stg_customers') }}

-- Type 2: Historical tracking
{% snapshot customers_snapshot %}
{{ config(strategy='timestamp', updated_at='updated_at') }}
select * from {{ ref('stg_customers') }}
{% endsnapshot %}

Fact Tables

sql

-- Transactional facts
{{ config(materialized='incremental', unique_key='order_id') }}
select
    order_id,
    customer_id,
    product_id,
    order_date,
    quantity,
    amount
from {{ ref('stg_orders') }}

Dimension Tables

sql

-- Dimension table
{{ config(materialized='table') }}
select
    customer_id,
    customer_name,
    customer_segment,
    customer_region
from {{ ref('stg_customers') }}

Troubleshooting

Common Errors

Compilation Error:

code

Compilation Error in model fct_orders
  Model 'stg_orders' not found

Fix: Check model name, ensure it exists

Test Failure:

code

Failure in test unique_fct_orders_order_id
  Got 5 results, expected 0

Fix: Investigate duplicate order_ids

Freshness Error:

code

Source 'raw.orders' is stale (loaded 25 hours ago)

Fix: Check ETL pipeline, data loading

Summary

dbt: SQL-first transformation tool

Layers:

•Staging: Clean raw data
•Intermediate: Business logic
•Marts: Analytics-ready

Materializations:

•View: Always fresh, slow
•Table: Fast, stale
•Incremental: Large data
•Ephemeral: No storage

Testing:

•unique, not_null, accepted_values, relationships
•Custom SQL tests

Best Practices:

•One model per file
•Use CTEs
•Test everything
•Document everything
•Version control

Workflow:

•Develop → Test → Document → PR → Merge → Deploy