dbt Expert Skill

When to Use This Skill

Activate this skill when:

•Writing dbt models (staging, intermediate, mart)
•Creating or updating schema.yml files
•Writing dbt tests (generic or custom)
•Implementing incremental models
•Optimizing dbt performance
•Debugging dbt errors
•Setting up dbt projects

Core Patterns

1. Model Layering (MANDATORY)

code

raw → staging → intermediate → mart
 ↓        ↓           ↓          ↓
API    Clean      Business    Final
data   + cast      logic     product

Staging Models (stg_):

sql

-- models/staging/youtube/stg_youtube__videos.sql
WITH source AS (
    SELECT * FROM {{ source('raw', 'youtube_videos') }}
),

renamed AS (
    SELECT
        -- Keys
        video_id,
        channel_id AS creator_id,
        
        -- Dimensions
        title,
        description,
        
        -- Metrics (cast to proper types)
        CAST(view_count AS INT64) AS views,
        CAST(like_count AS INT64) AS likes,
        CAST(comment_count AS INT64) AS comments,
        
        -- Timestamps
        PARSE_TIMESTAMP('%Y-%m-%dT%H:%M:%SZ', published_at) AS published_at,
        
        -- Metadata
        'youtube' AS platform,
        _loaded_at AS source_loaded_at
    FROM source
)

SELECT * FROM renamed

Key principles:

•1:1 with source tables
•Only renaming, type casting, and simple calculations
•No joins, no aggregations, no business logic
•Add platform identifier for cross-platform models

Intermediate Models (int_):

sql

-- models/intermediate/int_creator_video_metrics.sql
WITH video_stats AS (
    SELECT * FROM {{ ref('stg_youtube__videos') }}
),

creator_info AS (
    SELECT * FROM {{ ref('stg_youtube__creators') }}
),

joined AS (
    SELECT
        v.video_id,
        v.creator_id,
        c.creator_name,
        c.subscriber_count,
        
        -- Calculated metrics
        v.views,
        v.likes,
        v.comments,
        SAFE_DIVIDE(v.likes, v.views) AS like_rate,
        SAFE_DIVIDE(v.comments, v.views) AS comment_rate,
        (v.likes + v.comments * 2) / v.views AS engagement_rate,
        
        v.published_at,
        v.platform
    FROM video_stats v
    INNER JOIN creator_info c
        ON v.creator_id = c.creator_id
)

SELECT * FROM joined

Key principles:

•Joins between staging models
•Business logic and calculations
•Can be ephemeral or table materialized
•Break complex logic into multiple intermediate steps

Mart Models (fact_, dim_):

sql

-- models/mart/fact_video_performance.sql
{{
    config(
        materialized='table',
        partition_by={
            "field": "published_date",
            "data_type": "date"
        },
        cluster_by=["creator_id", "platform"]
    )
}}

WITH video_metrics AS (
    SELECT * FROM {{ ref('int_creator_video_metrics') }}
),

final AS (
    SELECT
        -- Keys
        video_id,
        creator_id,
        
        -- Dimensions
        creator_name,
        platform,
        DATE(published_at) AS published_date,
        
        -- Metrics
        views,
        likes,
        comments,
        like_rate,
        comment_rate,
        engagement_rate,
        
        -- Metadata
        published_at,
        CURRENT_TIMESTAMP() AS dbt_updated_at
    FROM video_metrics
)

SELECT * FROM final

Key principles:

•Final tables for consumption (APIs, dashboards, analysts)
•Proper materialization strategy
•Clear grain (one row = ?)
•All columns documented in schema.yml

2. Incremental Models Pattern

sql

-- models/mart/fact_daily_creator_stats.sql
{{
    config(
        materialized='incremental',
        partition_by={
            "field": "stat_date",
            "data_type": "date"
        },
        cluster_by=["creator_id", "platform"],
        unique_key='creator_stat_key',
        on_schema_change='fail'
    )
}}

WITH daily_stats AS (
    SELECT
        -- Create unique key
        {{ dbt_utils.generate_surrogate_key(['creator_id', 'stat_date', 'platform']) }} AS creator_stat_key,
        
        creator_id,
        DATE(event_timestamp) AS stat_date,
        platform,
        
        SUM(views) AS total_views,
        SUM(revenue_usd) AS total_revenue,
        COUNT(DISTINCT video_id) AS videos_published
        
    FROM {{ ref('stg_platform_events') }}
    WHERE 1=1
        AND event_timestamp >= '2024-01-01'  -- Performance: limit historical scan
        
        {% if is_incremental() %}
            -- Only process new/updated data
            AND DATE(event_timestamp) > (
                SELECT MAX(stat_date) 
                FROM {{ this }}
            )
        {% endif %}
        
    GROUP BY 1, 2, 3, 4
)

SELECT * FROM daily_stats

Key patterns:

•unique_key: Required for deduplication
•is_incremental(): Only process new data on incremental runs
•on_schema_change: 'fail' prevents accidental schema changes
•Partition + cluster: For query performance
•Watermark column: Use MAX(date) for incremental logic

When to use incremental:

•✅ Large fact tables (>1M rows)
•✅ Append-only data
•✅ Clear watermark column
•❌ Small dimension tables
•❌ Data can be updated/deleted
•❌ Complex backfill requirements

3. Testing Strategy

yaml

# models/staging/youtube/stg_youtube__videos.yml
version: 2

models:
  - name: stg_youtube__videos
    description: Cleaned YouTube video data from raw API ingestion
    
    tests:
      - dbt_utils.expression_is_true:
          expression: "views >= 0"
          config:
            severity: error
    
    columns:
      - name: video_id
        description: Unique YouTube video identifier
        tests:
          - not_null
          - unique
        
      - name: creator_id
        description: YouTube channel ID of video creator
        tests:
          - not_null
          - relationships:
              to: ref('stg_youtube__creators')
              field: creator_id
              config:
                severity: warn  # Warning if creator not found
        
      - name: views
        description: Number of video views
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              inclusive: true
        
      - name: published_at
        description: Video publication timestamp
        tests:
          - not_null
          - dbt_utils.not_null_proportion:
              at_least: 0.99
        
      - name: platform
        description: Platform identifier (always 'youtube' for this model)
        tests:
          - not_null
          - accepted_values:
              values: ['youtube']

Test categories:

•Primary Key Tests (REQUIRED):

yaml

- not_null
- unique

•Foreign Key Tests (REQUIRED):

yaml

- relationships:
    to: ref('parent_table')
    field: parent_key

•Data Quality Tests:

yaml

- accepted_values:
    values: ['value1', 'value2']
- dbt_utils.accepted_range:
    min_value: 0
    max_value: 100

•Custom Tests:

sql

-- tests/assert_revenue_not_negative.sql
SELECT
    conversion_id,
    revenue_usd
FROM {{ ref('fact_conversions') }}
WHERE revenue_usd < 0

4. Sources Configuration

yaml

# models/staging/sources.yml
version: 2

sources:
  - name: raw
    description: Raw data ingested from platform APIs
    database: sponsorgraph-prod
    schema: raw
    
    tables:
      - name: youtube_videos
        description: Raw YouTube video data from YouTube Data API v3
        
        loaded_at_field: _loaded_at
        freshness:
          warn_after: {count: 24, period: hour}
          error_after: {count: 48, period: hour}
        
        columns:
          - name: video_id
            description: YouTube video ID
            tests:
              - not_null
              - unique
          
          - name: _loaded_at
            description: Timestamp when data was loaded to BigQuery
            tests:
              - not_null

Source freshness checks:

bash

# Check if data is fresh
dbt source freshness

# Output:
# 15:32:05  1 of 1 START freshness of raw.youtube_videos ..................... [RUN]
# 15:32:06  1 of 1 WARN freshness of raw.youtube_videos ...................... [WARN in 0.84s]

5. Macros for Reusability

sql

-- macros/calculate_engagement_rate.sql
{% macro calculate_engagement_rate(likes, comments, shares, views) %}
    SAFE_DIVIDE(
        {{ likes }} + ({{ comments }} * 2) + ({{ shares }} * 3),
        {{ views }}
    )
{% endmacro %}

Usage:

sql

SELECT
    video_id,
    {{ calculate_engagement_rate('likes', 'comments', 'shares', 'views') }} AS engagement_rate
FROM {{ ref('stg_youtube__videos') }}

Common macros:

sql

-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
    {%- set default_schema = target.schema -%}
    {%- if custom_schema_name is none -%}
        {{ default_schema }}
    {%- else -%}
        {{ custom_schema_name | trim }}
    {%- endif -%}
{%- endmacro %}

-- macros/grant_select.sql
{% macro grant_select(schema, role) %}
    GRANT SELECT ON ALL TABLES IN SCHEMA {{ schema }} TO {{ role }};
{% endmacro %}

6. dbt Project Configuration

yaml

# dbt_project.yml
name: 'sponsorgraph'
version: '1.0.0'
config-version: 2

profile: 'sponsorgraph'

model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"
clean-targets:
  - "target"
  - "dbt_packages"

models:
  sponsorgraph:
    # Staging models
    staging:
      +materialized: view
      +schema: staging
      
    # Intermediate models (ephemeral = not materialized)
    intermediate:
      +materialized: ephemeral
      
    # Mart models
    mart:
      +materialized: table
      +schema: mart
      
      # Fact tables use incremental
      fact:
        +materialized: incremental
        +on_schema_change: fail

seeds:
  sponsorgraph:
    +schema: seed
    +quote_columns: false

tests:
  sponsorgraph:
    +severity: error  # Default: fail on test failure

7. Performance Optimization

Use CTEs, not subqueries:

sql

-- Good
WITH base AS (
    SELECT * FROM {{ ref('large_table') }}
    WHERE date >= '2024-01-01'
),
aggregated AS (
    SELECT
        creator_id,
        SUM(revenue) AS total_revenue
    FROM base
    GROUP BY 1
)
SELECT * FROM aggregated;

-- Bad
SELECT
    creator_id,
    SUM(revenue) AS total_revenue
FROM (
    SELECT * FROM {{ ref('large_table') }}
    WHERE date >= '2024-01-01'
) subquery
GROUP BY 1;

Use ref() for dependencies:

sql

-- Good: dbt tracks dependencies
SELECT * FROM {{ ref('stg_youtube__videos') }}

-- Bad: breaks dependency graph
SELECT * FROM `project.staging.stg_youtube__videos`

Limit expensive operations:

sql

-- Good: Filter early
WITH filtered AS (
    SELECT * FROM {{ ref('fact_events') }}
    WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)  -- Partition filter
)
SELECT
    creator_id,
    COUNT(*) AS event_count
FROM filtered
GROUP BY 1;

-- Bad: Filter after aggregation
SELECT
    creator_id,
    COUNT(*) AS event_count
FROM {{ ref('fact_events') }}
GROUP BY 1
HAVING event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);

Common Issues & Solutions

Issue: "Ambiguous column name"

Cause: Column exists in multiple CTEs

Solution: Prefix with CTE name

sql

WITH cte1 AS (
    SELECT creator_id, name FROM table1
),
cte2 AS (
    SELECT creator_id, platform FROM table2
)
SELECT
    cte1.creator_id,  -- Explicitly specify
    cte1.name,
    cte2.platform
FROM cte1
JOIN cte2 ON cte1.creator_id = cte2.creator_id;

Issue: "Incremental model not updating"

Cause: is_incremental() condition too restrictive

Solution: Check watermark logic

sql

{% if is_incremental() %}
    -- Make sure this actually matches new data
    AND updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

Issue: "Compilation Error: depends on a node named 'X' which was not found"

Cause: Typo in ref() or source()

Solution: Check spelling

sql

-- Wrong
SELECT * FROM {{ ref('stg_youtube__video') }}  -- Missing 's'

-- Right
SELECT * FROM {{ ref('stg_youtube__videos') }}

Issue: "Resource 'X' not found"

Cause: Wrong schema or database

Solution: Check sources.yml configuration

yaml

sources:
  - name: raw
    database: sponsorgraph-prod  # Must match actual database
    schema: raw                  # Must match actual schema

dbt Commands Reference

bash

# Development workflow
dbt debug                          # Test connection
dbt deps                           # Install packages
dbt seed                           # Load seed files
dbt run                            # Run all models
dbt run --select stg_youtube__*    # Run specific models
dbt run --full-refresh             # Force full refresh (ignore incremental)
dbt test                           # Run all tests
dbt test --select stg_youtube__videos  # Test specific model
dbt docs generate                  # Generate documentation
dbt docs serve                     # Serve docs locally

# Advanced selectors
dbt run --select tag:daily         # Run models with tag
dbt run --select +fact_conversions # Run model + upstream dependencies
dbt run --select fact_conversions+ # Run model + downstream dependencies
dbt run --select @fact_conversions # Run model + all dependencies
dbt run --exclude tag:deprecated   # Exclude models

# Target environments
dbt run --target dev               # Run against dev
dbt run --target prod              # Run against prod

# Monitoring
dbt source freshness               # Check source freshness
dbt run-operation grant_select --args '{schema: mart, role: analyst}'  # Run macro

Documentation Best Practices

Model Documentation

yaml

models:
  - name: fact_video_performance
    description: |
      Video performance metrics aggregated daily by creator and platform.
      
      **Grain**: One row per video
      
      **Refresh**: Incremental (daily)
      
      **SLA**: Updated by 2 AM daily
      
      **Owners**: Data Team (@data-team)
    
    meta:
      owner: "@data-team"
      contains_pii: false
      
    config:
      tags: ['daily', 'mart', 'revenue']

Column Documentation

yaml

columns:
  - name: engagement_rate
    description: |
      Calculated engagement rate using weighted formula:
      (likes + comments*2 + shares*3) / views
      
      **Formula**: See macro `calculate_engagement_rate`
      
      **Range**: 0.0 to 1.0 (0% to 100%)
      
      **Null handling**: Returns 0 if views = 0
    
    meta:
      metric_type: "rate"
      aggregation: "average"

Testing Checklist

Before merging dbt changes:

Performance Benchmarks

Target performance (adjust based on data volume):

•Staging models: <30 seconds each
•Intermediate models: <2 minutes each
•Mart models: <5 minutes each
•Full dbt run: <15 minutes total
•Incremental run: <5 minutes total

If slower:

•Check partition filters
•Review join strategies
•Consider materializing intermediate steps
•Profile with dbt run --profiles-dir . --profile-limit 10

Related Resources

•Main context: .claude/CLAUDE.md
•BigQuery skill: .claude/skills/bigquery/SKILL.md
•dbt docs: https://docs.getdbt.com
•dbt utils: https://github.com/dbt-labs/dbt-utils