Databricks Autonomous Operations

1. Overview

This skill is both an SDK/CLI/Connect reference and an autonomous operations playbook. It teaches the AI agent to operate as an SRE — independently deploying, monitoring, diagnosing failures, applying fixes, redeploying, and verifying results across all Databricks resource types.

Core Loop: Deploy → Poll → Diagnose → Fix → Redeploy → Verify (max 3 iterations before escalation to user).

When to Activate This Skill

•Deploying or running Databricks Asset Bundles (bundle deploy, bundle run)
•Monitoring job or pipeline runs for completion
•Troubleshooting ANY failure: jobs, DLT pipelines, monitors, alerts, clusters, Genie Spaces
•Using the Databricks Python SDK, CLI, Connect, or REST API
•Encountering error messages from Databricks services
•Operating in a self-healing deploy-fix-redeploy cycle
•Checking job/task/pipeline status or retrieving run output

SDK Docs: https://databricks-sdk-py.readthedocs.io/en/latest/ GitHub: https://github.com/databricks/databricks-sdk-py

Runnable Examples: See examples/ directory for complete, copy-pasteable scripts:

•examples/1-authentication.py — All auth patterns (env vars, profiles, Azure SP, AccountClient, notebook context)
•examples/2-clusters-and-jobs.py — Cluster auto-selection, autoscaling, job CRUD, submit_and_wait, run_now_and_wait
•examples/3-sql-and-warehouses.py — Parameterized queries, chunked results, query_to_dataframe helper
•examples/4-unity-catalog.py — Tables, schemas, catalogs, volumes, file operations, pattern matching
•examples/5-serving-and-vector-search.py — Endpoint creation with TrafficConfig, chat/embedding queries, vector search with filters
•examples/6-autonomous-operations.py — Job monitoring, multi-task output retrieval, failure diagnosis, self-healing loop

2. Environment & Authentication

Setup

•SDK: uv pip install databricks-sdk; Connect: uv pip install databricks-connect
•CLI version: >= 0.278.0 (databricks --version)
•Config: ~/.databrickscfg or env vars DATABRICKS_HOST, DATABRICKS_TOKEN

Quick Auth

python

from databricks.sdk import WorkspaceClient
w = WorkspaceClient()                    # Auto-detect (env, config file, or notebook)
w = WorkspaceClient(profile="MY_PROFILE") # Named profile

•Token expired? databricks auth login --host <url> --profile <name>
•Profile-based CLI: DATABRICKS_CONFIG_PROFILE=<name> databricks <command>
•Full auth patterns (Azure SP, AccountClient, etc.): see references/sdk-api-reference.md

3. SDK API Quick Reference

Full code examples and patterns: references/sdk-api-reference.md Runnable scripts: examples/ directory (see file listing in Section 1)

API	Key Operations	Critical Notes
Clusters	`list`, `get`, `create_and_wait`, `ensure_cluster_is_running`, `start`, `stop`	Use `select_spark_version()` and `select_node_type()`
Jobs	`list`, `run_now_and_wait`, `submit_and_wait`, `get_run`, `get_run_output`, `cancel_run`	`get_run_output` needs task `run_id`, not parent job `run_id`
SQL Execution	`execute_statement` with `wait_timeout`	Use `StatementParameterListItem` for parameterized queries
Warehouses	`list`, `start`, `stop`, `create_and_wait`	Use `auto_stop_mins` to control costs
Unity Catalog	`tables.list`, `tables.get`, `tables.exists`, `catalogs.list`, `schemas.list`	`list_summaries` for fast pattern matching
Volumes/Files	`files.upload`, `files.download`, `files.list_directory_contents`	Path format: `/Volumes/catalog/schema/volume/file`
Serving	`create_and_wait`, `query` (custom/chat/embeddings), `get_open_ai_client`	`scale_to_zero_enabled` for cost control
Vector Search	`query_index`, `upsert_data_vector_index`, `sync_index`	Use `filters_json` for filtered queries
Pipelines	`list_pipelines`, `get`, `start_update`, `stop_and_wait`, `list_pipeline_events`	Events contain error details for troubleshooting
Monitors	`quality_monitors.get`, `delete`, `update`, `run_refresh`	Update MUST include ALL `custom_metrics`
Alerts V2	`alerts_v2.create_alert`, `update_alert`, `delete_alert`	Update requires full `update_mask`
Secrets	`create_scope`, `put_secret`, `get_secret`	Use for API keys, tokens

Critical Patterns

•Async apps (FastAPI): SDK is synchronous — wrap with asyncio.to_thread()
•Long-running ops: Use _and_wait() variants with timeout=timedelta(...)
•Error handling: from databricks.sdk.errors import NotFound, PermissionDenied, ResourceAlreadyExists
•Direct REST: w.api_client.do(method="GET", path="/api/2.0/...")

4. CLI Operations Reference

Command Matrix

Category	Command	Purpose
Bundle	`databricks bundle validate`	Pre-deploy validation (catches ~80% of errors)
Bundle	`databricks bundle deploy -t <target>`	Deploy all resources
Bundle	`databricks bundle run -t <target> <job>`	Start a job run
Bundle	`databricks bundle destroy -t <target>`	Cleanup all resources
Jobs	`databricks jobs get-run <RUN_ID> --output json`	Get run status
Jobs	`databricks jobs get-run-output <TASK_RUN_ID> --output json`	Get task notebook output
Jobs	`databricks jobs list-runs --job-id <JOB_ID> --output json`	List recent runs
Jobs	`databricks jobs cancel-run <RUN_ID>`	Cancel running job
Pipelines	`databricks pipelines get <PIPELINE_ID> --output json`	Get pipeline status
Pipelines	`databricks pipelines get-update <PID> <UID> --output json`	Get update details
Pipelines	`databricks pipelines list-pipeline-events <PID> --output json`	Get events/errors
Pipelines	`databricks pipelines start-update <PIPELINE_ID>`	Start pipeline update
Clusters	`databricks clusters get <CLUSTER_ID> --output json`	Get cluster status
Clusters	`databricks clusters events <CLUSTER_ID> --output json`	Get cluster events
Warehouses	`databricks warehouses get <WH_ID> --output json`	Get warehouse status
Auth	`databricks auth login --host <url> --profile <name>`	Re-authenticate
Workspace	`databricks workspace export <path> --format SOURCE`	Export notebook
Apps	`databricks apps logs <app-name>`	App deployment/runtime logs

Key jq Patterns

See references/cli-jq-patterns.md for the complete catalog. Most critical:

bash

# Job state
databricks jobs get-run <RUN_ID> --output json | jq '.state'

# Task summary for multi-task jobs
databricks jobs get-run <RUN_ID> --output json | jq '.tasks[] | {task: .task_key, run_id: .run_id, result: .state.result_state}'

# Failed tasks only
databricks jobs get-run <RUN_ID> --output json | jq '.tasks[] | select(.state.result_state == "FAILED") | {task: .task_key, error: .state.state_message, url: .run_page_url}'

# Task output (MUST use task run_id, not parent job run_id)
databricks jobs get-run-output <TASK_RUN_ID> --output json | jq -r '.notebook_output.result // "No output"'

5. Autonomous Deploy-Run-Fix Playbook

This is the core workflow. Follow these steps whenever deploying a bundle, running a job, or operating autonomously.

Step 0: Discover Bundle Resources

Before deploying, understand what's in the bundle:

bash

# Read databricks.yml to find targets and included resources
# Then read resource YAML files under resources/ directory
# Identify: jobs (atomic/composite/orchestrator), pipelines, dashboards, alerts

Determine the target (dev, staging, prod) from the user or default to dev.

Step 1: Validate & Deploy

bash

databricks bundle validate -t <target>   # Pre-flight — catches ~80% of errors
databricks bundle deploy -t <target>     # Deploy all resources

If validate fails: Read the error, fix the YAML (see Section 6 + common/databricks-asset-bundles skill), re-validate. If deploy fails: Common causes: auth expired (403), path resolution errors, invalid task types. See Section 6.

Step 2: Run

For jobs:

bash

databricks bundle run -t <target> <job_name>
# Output includes a Run URL like: https://<host>/#job/<JOB_ID>/run/<RUN_ID>
# EXTRACT the RUN_ID from this URL — you need it for polling

For DLT pipelines:

bash

databricks bundle run -t <target> <pipeline_name>
# Output includes: Pipeline URL with PIPELINE_ID
# Also shows UPDATE_ID for this specific update

Cursor-specific: Use block_until_ms: 0 to background bundle run immediately. Then read the terminal output file to capture the Run URL. Parse the RUN_ID or PIPELINE_ID from the URL.

Step 3: Poll Until Terminal State

Jobs — poll with exponential backoff (30s → 60s → 120s):

bash

databricks jobs get-run <RUN_ID> --output json | jq -r '.state.life_cycle_state'
# PENDING → RUNNING → TERMINATED (also: INTERNAL_ERROR, SKIPPED)

When life_cycle_state == TERMINATED, check the result:

bash

databricks jobs get-run <RUN_ID> --output json | jq -r '.state.result_state'
# SUCCESS → go to Step 4a    FAILED → go to Step 4b

DLT Pipelines — poll the latest update state:

bash

databricks pipelines get <PIPELINE_ID> --output json | jq -r '.latest_updates[0].state'
# COMPLETED → success    FAILED → get events for errors

Step 4a: On SUCCESS — Verify & Report

bash

# For multi-task jobs, verify all tasks succeeded:
databricks jobs get-run <RUN_ID> --output json \
  | jq '.tasks[] | {task: .task_key, result: .state.result_state}'

# Get task output (structured JSON from dbutils.notebook.exit):
databricks jobs get-run-output <TASK_RUN_ID> --output json \
  | jq -r '.notebook_output.result // "No output"'

Report success to user with the Run URL for verification. If this was part of a multi-job deployment, proceed to the next job.

Step 4b: On FAILURE — Diagnose

CRITICAL for multi-task jobs: get-run-output only works on the task run_id, NOT the parent job run_id.

bash

# Step A: Find failed tasks and their task-level run_ids
databricks jobs get-run <JOB_RUN_ID> --output json \
  | jq '.tasks[] | select(.state.result_state == "FAILED") | {task: .task_key, run_id: .run_id, error: .state.state_message}'

# Step B: Get detailed output for EACH failed task
databricks jobs get-run-output <TASK_RUN_ID> --output json \
  | jq -r '.notebook_output.result // .error // "No output"'

For DLT pipeline failures:

bash

databricks pipelines list-pipeline-events <PID> --output json \
  | jq '[.events[] | select(.level == "ERROR")] | .[0:5]'

Match the error against Section 6 (decision tree) and references/error-solution-matrix.md.

Step 5: Fix

•Read the source file(s) identified from the stack trace or error message
•Apply fix using editor tools (StrReplace, Write)
•If YAML/config issue → fix the DAB YAML (consult common/databricks-asset-bundles skill)
•If dependency issue → deploy upstream assets first (see Section 6 ordering)

Step 6: Redeploy & Re-run (Loop)

bash

databricks bundle deploy -t <target>    # Redeploy with fix
databricks bundle run -t <target> <job>  # Re-run
# → Return to Step 3 (Poll)

Maximum 3 iterations. Track each iteration's error and fix.

Step 7: Escalation (After 3 Failed Attempts)

Present to the user:

•All errors encountered — with run IDs, task keys, error messages
•All fixes attempted — what was changed and why
•Root cause hypothesis — best guess based on evidence
•Run page URLs — direct links to the Databricks UI

Step 8: Capture Learnings

After every resolved failure (or escalation), trigger admin/self-improvement:

•Add new error → fix mappings to references/error-solution-matrix.md
•Update "Top 10" table in Section 6 if this error is frequent
•Always capture if fix took 2+ iterations or was escalated

6. Master Troubleshooting Decision Tree

Full error-solution tables: references/error-solution-matrix.md (7 categories, 40+ patterns) DLT-specific troubleshooting: references/dlt-pipeline-troubleshooting.md Diagnostic SQL queries: references/diagnostic-queries.md

Quick Decision Guide

•Deployment error? → Check auth (403), YAML syntax, task type (notebook_task not python_task), path resolution
•Job runtime error? → Get task-level output (use task run_id), match error to solution matrix
•DLT pipeline error? → Get events via CLI, check upstream tables exist, verify schema
•Monitor/Alert error? → Check ResourceAlreadyExists (delete+recreate), schema existence, update_mask
•Dependency error? → Follow required order: Bronze → Gold Setup → Gold Merge → Semantic → Monitoring → Genie
•Cluster error? → Check events via CLI, look for OOM/unreachable/timeout patterns

Top 10 Most Common Errors (Inline)

Error	Quick Fix
`ModuleNotFoundError`	Add to `%pip install` or DAB environment spec
`TABLE_OR_VIEW_NOT_FOUND`	Run setup job first; check 3-part catalog.schema.table path
`DELTA_MULTIPLE_SOURCE_ROW_MATCHING`	Deduplicate source before MERGE
`Invalid access token (403)`	`databricks auth login --host <url> --profile <name>`
`ResourceAlreadyExists`	Delete + recreate (monitors, alerts)
`python_task not recognized`	Use `notebook_task` with `notebook_path`
`PARSE_SYNTAX_ERROR`	Read failing SQL file, fix syntax, redeploy
`Parameter not found`	Use `base_parameters` dict, not CLI-style `parameters`
`run_job_task` vs `job_task`	Use `run_job_task` (not `job_task`)
Genie `INTERNAL_ERROR`	Deploy semantic layer (TVFs + Metric Views) first

7. Job Design for Monitorability

Full patterns: references/job-design-patterns.md — structured output, progress messages, fail-fast, partial success, 3-layer hierarchy.

Key rules:

•Every notebook must dbutils.notebook.exit(json.dumps({...})) with structured status
•Print [Step N/M] progress messages so the agent can locate failures
•Fail fast: include actionable error messages (e.g., "Run bronze_setup_job first")
•3-layer hierarchy: atomic (notebook_task) → composite (run_job_task) → orchestrator (run_job_task)
•Partial success: ≥90% items succeeding = OK; fix individual failures

8. Agent Behavioral Patterns

Background Command Pattern (Cursor-Specific)

For databricks bundle run or any long-running command:

•Execute with block_until_ms: 0 to background immediately
•Read the terminal output file to check for completion
•Parse RUN_ID from the output URL

Exponential Backoff for Polling

•Normal jobs: 30s → 60s → 120s (no state change = increase interval)
•Long-running training jobs: start at 60s, max 300s
•DLT pipelines: 30s (typically faster)

Context Capture on Every Failure

On each failure, record and maintain:

•run_id and task_run_id
•job_name and task_key
•Target environment (dev, staging, prod)
•Full error message
•run_page_url for UI access
•Iteration count (which fix attempt this is)

Self-Healing Loop

code

Iteration 1: Deploy → Run → Monitor → [FAIL] → Diagnose → Fix → Redeploy
Iteration 2: Run → Monitor → [FAIL] → Diagnose → Fix → Redeploy
Iteration 3: Run → Monitor → [FAIL] → ESCALATE TO USER

Dependency Awareness

Before deploying, verify prerequisites exist:

•Genie Spaces require: TVFs + Metric Views (semantic layer)
•Monitoring requires: Gold tables (populated)
•Gold merge requires: Gold tables (setup job)
•Silver DLT requires: Bronze tables
•Alerts require: Monitoring output tables
•Always follow: Bronze → Gold → Semantic → Monitoring → Alerts → Genie

Partial Success Handling

Some jobs use ≥90% success thresholds:

•Alerting deployment: 27/30 alerts succeeding = OK, fix the 3 individually
•Monitoring setup: 7/8 monitors = OK, debug the 1 failure
•Do NOT fail the entire workflow for 1 out of N items failing

CLI Limitations Awareness

Some diagnostics require the Databricks Workspace UI:

•Full notebook execution logs (when get-run-output returns truncated or empty)
•DLT pipeline data flow visualization
•Detailed Spark execution plans

Always provide run_page_url to the user when CLI output is insufficient.

Learning Capture Pattern (Self-Improvement Integration)

After every troubleshooting session (successful or escalated), run this checklist:

code

Post-Troubleshooting Self-Improvement:
1. Was this error already in references/error-solution-matrix.md?
   ├── YES → Was the documented fix accurate? If not, update it.
   └── NO  → Add the error pattern and fix to the matrix.

2. Did the initial diagnosis match the actual root cause?
   ├── YES → No action needed.
   └── NO  → Document the misleading signals in the relevant skill.

3. How many iterations were needed?
   ├── 1 → Capture only if error was novel or non-obvious.
   ├── 2-3 → ALWAYS update the relevant skill with the fix.
   └── Escalated → Document what the user found; update skills with user's fix.

4. Is this error likely to recur?
   ├── YES → Ensure it's in Top 10 table (if frequent) or error-solution matrix.
   └── NO  → Document as a one-off note in the session, skip skill update.

Skill priority for updates (search in this order):

•references/error-solution-matrix.md — error-to-fix mapping
•common/databricks-autonomous-operations — troubleshooting decision tree
•common/databricks-asset-bundles — if deployment-related
•Domain-specific skill (e.g., gold/gold-layer-merge-patterns for MERGE errors)
•New skill — only if all above are irrelevant (see admin/self-improvement justification checklist)

Safety: Never Retry Destructive Operations

Without explicit user confirmation, NEVER retry:

•databricks bundle destroy
•DROP TABLE / DROP SCHEMA
•w.quality_monitors.delete()
•w.alerts_v2.delete_alert()

References

Reference Documents (Load on Demand)

•references/sdk-api-reference.md — Full SDK API with code examples for all services
•references/error-solution-matrix.md — Error tables for all 7 resource categories (40+ patterns)
•references/diagnostic-queries.md — SQL queries for investigation
•references/dlt-pipeline-troubleshooting.md — DLT pipeline failure patterns
•references/cli-jq-patterns.md — jq patterns for parsing CLI JSON output
•references/cli-quick-reference.md — CLI vs SDK side-by-side table
•references/job-design-patterns.md — Structured output, progress messages, 3-layer hierarchy

Runnable Examples

•examples/1-authentication.py through examples/6-autonomous-operations.py (see Section 1)

Scripts

•scripts/monitor_multitask_job.sh — Multi-task job monitoring shell script

External Docs

•SDK: https://databricks-sdk-py.readthedocs.io/en/latest/
•GitHub: https://github.com/databricks/databricks-sdk-py