Troubleshooting dbt Job Errors
Systematically diagnose and resolve dbt Cloud job failures using available MCP tools, CLI commands, and data investigation.
When to Use
- •dbt Cloud / dbt platform job failed and you need to find the root cause
- •Intermittent job failures that are hard to reproduce
- •Error messages that don't clearly indicate the problem
- •Post-merge failures where a recent change may have caused the issue
Not for: Local dbt development errors - use the skill using-dbt-for-analytics-engineering instead
The Iron Rule
Never modify a test to make it pass without understanding why it's failing.
A failing test is evidence of a problem. Changing the test to pass hides the problem. Investigate the root cause first.
Rationalizations That Mean STOP
| You're Thinking... | Reality |
|---|---|
| "Just make the test pass" | The test is telling you something is wrong. Investigate first. |
| "There's a board meeting in 2 hours" | Rushing to a fix without diagnosis creates bigger problems. |
| "We've already spent 2 days on this" | Sunk cost doesn't justify skipping proper diagnosis. |
| "I'll just update the accepted values" | Are the new values valid business data or bugs? Verify first. |
| "It's probably just a flaky test" | "Flaky" means there's an overall issue. Find it. We don't allow flaky tests to stay. |
Workflow
flowchart TD
A[Job failure reported] --> B{MCP Admin API available?}
B -->|yes| C[Use list_jobs_runs to get history]
B -->|no| D[Ask user for logs and run_results.json]
C --> E[Use get_job_run_error for details]
D --> F[Classify error type]
E --> F
F --> G{Error type?}
G -->|Infrastructure| H[Check warehouse, connections, timeouts]
G -->|Code/Compilation| I[Check git history for recent changes]
G -->|Data/Test Failure| J[Use discovering-data skill to investigate]
H --> K{Root cause found?}
I --> K
J --> K
K -->|yes| L[Create branch, implement fix]
K -->|no| M[Create findings document]
L --> N[Add test - prefer unit test]
N --> O[Create PR with explanation]
M --> P[Document what was checked and next steps]
Step 1: Gather Job Run Information
If dbt MCP Server Admin API Available
Use these tools first - they provide the most comprehensive data:
| Tool | Purpose |
|---|---|
list_jobs_runs | Get recent run history, identify patterns |
get_job_run_error | Get detailed error message and context |
# Example: Get recent runs for job 12345 list_jobs_runs(job_id=12345, limit=10) # Example: Get error details for specific run get_job_run_error(run_id=67890)
Without MCP Admin API
Ask the user to provide these artifacts:
- •Job run logs from dbt Cloud UI (Debug logs preferred)
- •
run_results.json- contains execution status for each node
To get the run_results.json, generate the artifact URL for the user:
https://<DBT_ENDPOINT>/api/v2/accounts/<ACCOUNT_ID>/runs/<RUN_ID>/artifacts/run_results.json?step=<STEP_NUMBER>
Where:
- •
<DBT_ENDPOINT>- The dbt Cloud endpoint. e.g- •
cloud.getdbt.comfor the US multi-tenant platform (there are other endpoints for other regions) - •
ACCOUNT_PREFIX.us1.dbt.comfor the cell-based platforms (there are different cell endpoints for different regions and cloud providers)
- •
- •
<ACCOUNT_ID>- The dbt Cloud account ID - •
<RUN_ID>- The failed job run ID - •
<STEP_NUMBER>- The step that failed (e.g., if step 4 failed, use?step=4)
Example request:
"I don't have access to the dbt MCP server. Could you provide:
- •The debug logs from dbt Cloud (Job Run → Logs → Download)
- •The run_results.json - open this URL and copy/paste or upload the contents:
https://cloud.getdbt.com/api/v2/accounts/12345/runs/67890/artifacts/run_results.json?step=4
Step 2: Classify the Error
| Error Type | Indicators | Primary Investigation |
|---|---|---|
| Infrastructure | Connection timeout, warehouse error, permissions | Check warehouse status, connection settings |
| Code/Compilation | Undefined macro, syntax error, parsing error | Check git history for recent changes, use LSP tools |
| Data/Test Failure | Test failed with N results, schema mismatch | Use discovering-data skill to query actual data |
Step 3: Investigate Root Cause
For Infrastructure Errors
- •Check job configuration (timeout settings, execution steps, etc.)
- •Look for concurrent jobs competing for resources
- •Check if failures correlate with time of day or data volume
For Code/Compilation Errors
- •
Check git history for recent changes:
If you're not in the dbt project directory, use the dbt MCP server to find the repository:
code# Get project details including repository URL and project subdirectory get_project_details(project_id=<project_id>)
The response includes:
- •
repository- The git repository URL - •
dbt_project_subdirectory- Optional subfolder where the dbt project lives (e.g.,dbt/,transform/analytics/)
Then either:
- •Query the repository directly using
ghCLI if it's on GitHub - •Clone to a temporary folder:
git clone <repo_url> /tmp/dbt-investigation
Important: If the project is in a subfolder, navigate to it after cloning:
bashcd /tmp/dbt-investigation/<project_subdirectory>
Once in the project directory:
bashgit log --oneline -20 git diff HEAD~5..HEAD -- models/ macros/
- •
- •
Use the CLI and LSP tools from the dbt MCP server or use the dbt CLI to check for errors:
If the dbt MCP server is available, use its tools:
code# CLI tools mcp__dbt_parse() # Check for parsing errors mcp__dbt_list_models() # With selectos and `+` for finding models dependencies mcp__dbt_compile(models="failing_model") # Check compilation # LSP tools mcp__dbt_get_column_lineage() # Check column lineage
Otherwise, use the dbt CLI directly:
bashdbt parse # Check for parsing errors dbt list --select +failing_model # Check for models upstream of the failing model dbt compile --select failing_model # Check compilation
- •
Search for the error pattern:
- •Find where the undefined macro/model should be defined
- •Check if a file was deleted or renamed
For Data/Test Failures
Use the discovering-data skill to investigate the actual data.
- •
Get the test SQL
bashdbt compile --select project_name.folder1.folder2.test_unique_name --output json
the full path for the test can be found with a
dbt ls --resource-type testcommand - •
Query the failing test's underlying data:
bashdbt show --inline "<query_from_the_test_SQL>" --output json
- •
Compare to recent git changes:
- •Did a transformation change introduce new values?
- •Did upstream source data change?
Step 4: Resolution
If Root Cause Is Found
- •
Create a new branch:
bashgit checkout -b fix/job-failure-<description>
- •
Implement the fix addressing the actual root cause
- •
Add a test to prevent recurrence:
- •Prefer unit tests for logic issues
- •Use data tests for data quality issues
- •Example unit test for transformation logic:
yamlunit_tests: - name: test_status_mapping model: orders given: - input: ref('stg_orders') rows: - {status_code: 1, expected_status: 'pending'} - {status_code: 2, expected_status: 'shipped'} expect: rows: - {status: 'pending'} - {status: 'shipped'} - •
Create a PR with:
- •Description of the issue
- •Root cause analysis
- •How the fix resolves it
- •Test coverage added
If Root Cause Is NOT Found
Do not guess. Create a findings document.
Create docs/investigations/job-failure-<date>.md:
# Job Failure Investigation: <Job Name> **Date:** YYYY-MM-DD **Job ID:** <id> **Status:** Unresolved ## Summary Brief description of the failure and symptoms. ## What Was Checked ### Tools Used - [ ] list_jobs_runs - findings - [ ] get_job_run_error - findings - [ ] git history - findings - [ ] Data investigation - findings ### Hypotheses Tested | Hypothesis | Evidence | Result | |------------|----------|--------| | Recent code change | No changes to affected models in 7 days | Ruled out | ## Patterns Observed - Failures occur between 2-4 AM (peak load time?) - Always fails on model X ## Suggested Next Steps 1. [ ] Check the data ingestion process to see if new data was added 2. [ ] Check if a new version of dbt or of the dbt adapter was released ## Related Resources - Link to job run logs - Link to relevant documentation
Commit this document to the repository so findings aren't lost.
Quick Reference
| Task | Tool/Command |
|---|---|
| Get job run history | list_jobs_runs (MCP) |
| Get detailed error | get_job_run_error (MCP) |
| Check recent git changes | git log --oneline -20 |
| Parse project | dbt parse |
| Compile specific model | dbt compile --select model_name |
| Query data | dbt show --inline "SELECT ..." --output json |
| Run specific test | dbt test --select test_name |
Common Mistakes
Modifying tests to pass without investigation
- •A failing test is a signal, not an obstacle. Understand WHY before changing anything.
Skipping git history review
- •Most failures correlate with recent changes. Always check what changed.
Not documenting when unresolved
- •"I couldn't figure it out" leaves no trail. Document what was checked and what remains.
Making best-guess fixes under pressure
- •A wrong fix creates more problems. Take time to diagnose properly.
Ignoring data investigation for test failures
- •Test failures often reveal data issues. Query the actual data before assuming code is wrong.