AgentSkillsCN

databricks-autonomous-operations

端到端的自主部署与运维技能。部署Databricks资产包,运行作业与管道,轮询任务完成状态,诊断故障,应用修复措施,重新部署并验证——全程无需人工干预。同时,也作为SDK/CLI/REST API的参考指南。在部署资产包、运行作业、监控管道、排查任意故障,或在自我修复—修复—重新部署的循环中运行时使用此功能。可通过“部署”“资产包部署”“运行作业”“运行管道”“让它运转起来”“作业失败”“排查故障”“修复并重新部署”等触发器来启动该操作。

SKILL.md
--- frontmatter
name: databricks-autonomous-operations
description: >
  End-to-end autonomous deployment and operations skill. Deploys Databricks Asset Bundles,
  runs jobs and pipelines, polls for completion, diagnoses failures, applies fixes, redeploys,
  and verifies — all without human intervention. Also serves as SDK/CLI/REST API reference.
  Use when deploying bundles, running jobs, monitoring pipelines, troubleshooting ANY failure,
  or operating in a self-healing deploy-fix-redeploy cycle. Triggers on "deploy", "bundle deploy",
  "run job", "run pipeline", "make it work", "job failed", "troubleshoot", "fix and redeploy".
metadata:
  author: prashanth subrahmanyam
  version: "3.0"
  domain: operations
  role: shared
  used_by_stages: [1, 2, 3, 4, 5, 6, 7, 8, 9]
  called_by:
    - bronze/00-bronze-layer-setup
    - silver/00-silver-layer-setup
    - gold/01-gold-layer-setup
    - semantic-layer/00-semantic-layer-setup
    - monitoring/00-observability-setup
    - ml/00-ml-pipeline-setup
    - genai-agents/00-genai-agents-setup
  dependencies:
    - admin/self-improvement
  triggers:
    - "deploy"
    - "bundle deploy"
    - "bundle run"
    - "run job"
    - "run pipeline"
    - "make it work"
    - "deploy and run"
    - "deploy and fix"
    - "job failed"
    - "monitor job"
    - "troubleshoot"
    - "get-run"
    - "run output"
    - "task failed"
    - "redeploy"
    - "job status"
    - "pipeline failed"
    - "DLT error"
    - "cluster issue"
    - "monitor failed"
    - "alert failed"
    - "deploy failed"
    - "databricks-sdk"
    - "databricks-connect"
    - "CLI"
    - "INTERNAL_ERROR"
    - "UNRESOLVED_COLUMN"
    - "ModuleNotFoundError"
    - "TABLE_OR_VIEW_NOT_FOUND"
    - "ResourceAlreadyExists"
    - "PERMISSION_DENIED"
    - "self-heal"
    - "autonomous"
  last_verified: "2026-02-07"
  volatility: medium
  upstream_sources:
    - name: "ai-dev-kit"
      repo: "databricks-solutions/ai-dev-kit"
      paths:
        - "databricks-skills/databricks-python-sdk/SKILL.md"
        - "databricks-skills/databricks-jobs/SKILL.md"
      relationship: "extended"
      last_synced: "2026-02-09"
      sync_commit: "97a3637"

Databricks Autonomous Operations

1. Overview

This skill is both an SDK/CLI/Connect reference and an autonomous operations playbook. It teaches the AI agent to operate as an SRE — independently deploying, monitoring, diagnosing failures, applying fixes, redeploying, and verifying results across all Databricks resource types.

Core Loop: Deploy → Poll → Diagnose → Fix → Redeploy → Verify (max 3 iterations before escalation to user).

When to Activate This Skill

  • Deploying or running Databricks Asset Bundles (bundle deploy, bundle run)
  • Monitoring job or pipeline runs for completion
  • Troubleshooting ANY failure: jobs, DLT pipelines, monitors, alerts, clusters, Genie Spaces
  • Using the Databricks Python SDK, CLI, Connect, or REST API
  • Encountering error messages from Databricks services
  • Operating in a self-healing deploy-fix-redeploy cycle
  • Checking job/task/pipeline status or retrieving run output

SDK Docs: https://databricks-sdk-py.readthedocs.io/en/latest/ GitHub: https://github.com/databricks/databricks-sdk-py

Runnable Examples: See examples/ directory for complete, copy-pasteable scripts:

  • examples/1-authentication.py — All auth patterns (env vars, profiles, Azure SP, AccountClient, notebook context)
  • examples/2-clusters-and-jobs.py — Cluster auto-selection, autoscaling, job CRUD, submit_and_wait, run_now_and_wait
  • examples/3-sql-and-warehouses.py — Parameterized queries, chunked results, query_to_dataframe helper
  • examples/4-unity-catalog.py — Tables, schemas, catalogs, volumes, file operations, pattern matching
  • examples/5-serving-and-vector-search.py — Endpoint creation with TrafficConfig, chat/embedding queries, vector search with filters
  • examples/6-autonomous-operations.py — Job monitoring, multi-task output retrieval, failure diagnosis, self-healing loop

2. Environment & Authentication

Setup

  • SDK: uv pip install databricks-sdk; Connect: uv pip install databricks-connect
  • CLI version: >= 0.278.0 (databricks --version)
  • Config: ~/.databrickscfg or env vars DATABRICKS_HOST, DATABRICKS_TOKEN

Quick Auth

python
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()                    # Auto-detect (env, config file, or notebook)
w = WorkspaceClient(profile="MY_PROFILE") # Named profile
  • Token expired? databricks auth login --host <url> --profile <name>
  • Profile-based CLI: DATABRICKS_CONFIG_PROFILE=<name> databricks <command>
  • Full auth patterns (Azure SP, AccountClient, etc.): see references/sdk-api-reference.md

3. SDK API Quick Reference

Full code examples and patterns: references/sdk-api-reference.md Runnable scripts: examples/ directory (see file listing in Section 1)

APIKey OperationsCritical Notes
Clusterslist, get, create_and_wait, ensure_cluster_is_running, start, stopUse select_spark_version() and select_node_type()
Jobslist, run_now_and_wait, submit_and_wait, get_run, get_run_output, cancel_runget_run_output needs task run_id, not parent job run_id
SQL Executionexecute_statement with wait_timeoutUse StatementParameterListItem for parameterized queries
Warehouseslist, start, stop, create_and_waitUse auto_stop_mins to control costs
Unity Catalogtables.list, tables.get, tables.exists, catalogs.list, schemas.listlist_summaries for fast pattern matching
Volumes/Filesfiles.upload, files.download, files.list_directory_contentsPath format: /Volumes/catalog/schema/volume/file
Servingcreate_and_wait, query (custom/chat/embeddings), get_open_ai_clientscale_to_zero_enabled for cost control
Vector Searchquery_index, upsert_data_vector_index, sync_indexUse filters_json for filtered queries
Pipelineslist_pipelines, get, start_update, stop_and_wait, list_pipeline_eventsEvents contain error details for troubleshooting
Monitorsquality_monitors.get, delete, update, run_refreshUpdate MUST include ALL custom_metrics
Alerts V2alerts_v2.create_alert, update_alert, delete_alertUpdate requires full update_mask
Secretscreate_scope, put_secret, get_secretUse for API keys, tokens

Critical Patterns

  • Async apps (FastAPI): SDK is synchronous — wrap with asyncio.to_thread()
  • Long-running ops: Use _and_wait() variants with timeout=timedelta(...)
  • Error handling: from databricks.sdk.errors import NotFound, PermissionDenied, ResourceAlreadyExists
  • Direct REST: w.api_client.do(method="GET", path="/api/2.0/...")

4. CLI Operations Reference

Command Matrix

CategoryCommandPurpose
Bundledatabricks bundle validatePre-deploy validation (catches ~80% of errors)
Bundledatabricks bundle deploy -t <target>Deploy all resources
Bundledatabricks bundle run -t <target> <job>Start a job run
Bundledatabricks bundle destroy -t <target>Cleanup all resources
Jobsdatabricks jobs get-run <RUN_ID> --output jsonGet run status
Jobsdatabricks jobs get-run-output <TASK_RUN_ID> --output jsonGet task notebook output
Jobsdatabricks jobs list-runs --job-id <JOB_ID> --output jsonList recent runs
Jobsdatabricks jobs cancel-run <RUN_ID>Cancel running job
Pipelinesdatabricks pipelines get <PIPELINE_ID> --output jsonGet pipeline status
Pipelinesdatabricks pipelines get-update <PID> <UID> --output jsonGet update details
Pipelinesdatabricks pipelines list-pipeline-events <PID> --output jsonGet events/errors
Pipelinesdatabricks pipelines start-update <PIPELINE_ID>Start pipeline update
Clustersdatabricks clusters get <CLUSTER_ID> --output jsonGet cluster status
Clustersdatabricks clusters events <CLUSTER_ID> --output jsonGet cluster events
Warehousesdatabricks warehouses get <WH_ID> --output jsonGet warehouse status
Authdatabricks auth login --host <url> --profile <name>Re-authenticate
Workspacedatabricks workspace export <path> --format SOURCEExport notebook
Appsdatabricks apps logs <app-name>App deployment/runtime logs

Key jq Patterns

See references/cli-jq-patterns.md for the complete catalog. Most critical:

bash
# Job state
databricks jobs get-run <RUN_ID> --output json | jq '.state'

# Task summary for multi-task jobs
databricks jobs get-run <RUN_ID> --output json | jq '.tasks[] | {task: .task_key, run_id: .run_id, result: .state.result_state}'

# Failed tasks only
databricks jobs get-run <RUN_ID> --output json | jq '.tasks[] | select(.state.result_state == "FAILED") | {task: .task_key, error: .state.state_message, url: .run_page_url}'

# Task output (MUST use task run_id, not parent job run_id)
databricks jobs get-run-output <TASK_RUN_ID> --output json | jq -r '.notebook_output.result // "No output"'

5. Autonomous Deploy-Run-Fix Playbook

This is the core workflow. Follow these steps whenever deploying a bundle, running a job, or operating autonomously.

Step 0: Discover Bundle Resources

Before deploying, understand what's in the bundle:

bash
# Read databricks.yml to find targets and included resources
# Then read resource YAML files under resources/ directory
# Identify: jobs (atomic/composite/orchestrator), pipelines, dashboards, alerts

Determine the target (dev, staging, prod) from the user or default to dev.

Step 1: Validate & Deploy

bash
databricks bundle validate -t <target>   # Pre-flight — catches ~80% of errors
databricks bundle deploy -t <target>     # Deploy all resources

If validate fails: Read the error, fix the YAML (see Section 6 + common/databricks-asset-bundles skill), re-validate. If deploy fails: Common causes: auth expired (403), path resolution errors, invalid task types. See Section 6.

Step 2: Run

For jobs:

bash
databricks bundle run -t <target> <job_name>
# Output includes a Run URL like: https://<host>/#job/<JOB_ID>/run/<RUN_ID>
# EXTRACT the RUN_ID from this URL — you need it for polling

For DLT pipelines:

bash
databricks bundle run -t <target> <pipeline_name>
# Output includes: Pipeline URL with PIPELINE_ID
# Also shows UPDATE_ID for this specific update

Cursor-specific: Use block_until_ms: 0 to background bundle run immediately. Then read the terminal output file to capture the Run URL. Parse the RUN_ID or PIPELINE_ID from the URL.

Step 3: Poll Until Terminal State

Jobs — poll with exponential backoff (30s → 60s → 120s):

bash
databricks jobs get-run <RUN_ID> --output json | jq -r '.state.life_cycle_state'
# PENDING → RUNNING → TERMINATED (also: INTERNAL_ERROR, SKIPPED)

When life_cycle_state == TERMINATED, check the result:

bash
databricks jobs get-run <RUN_ID> --output json | jq -r '.state.result_state'
# SUCCESS → go to Step 4a    FAILED → go to Step 4b

DLT Pipelines — poll the latest update state:

bash
databricks pipelines get <PIPELINE_ID> --output json | jq -r '.latest_updates[0].state'
# COMPLETED → success    FAILED → get events for errors

Step 4a: On SUCCESS — Verify & Report

bash
# For multi-task jobs, verify all tasks succeeded:
databricks jobs get-run <RUN_ID> --output json \
  | jq '.tasks[] | {task: .task_key, result: .state.result_state}'

# Get task output (structured JSON from dbutils.notebook.exit):
databricks jobs get-run-output <TASK_RUN_ID> --output json \
  | jq -r '.notebook_output.result // "No output"'

Report success to user with the Run URL for verification. If this was part of a multi-job deployment, proceed to the next job.

Step 4b: On FAILURE — Diagnose

CRITICAL for multi-task jobs: get-run-output only works on the task run_id, NOT the parent job run_id.

bash
# Step A: Find failed tasks and their task-level run_ids
databricks jobs get-run <JOB_RUN_ID> --output json \
  | jq '.tasks[] | select(.state.result_state == "FAILED") | {task: .task_key, run_id: .run_id, error: .state.state_message}'

# Step B: Get detailed output for EACH failed task
databricks jobs get-run-output <TASK_RUN_ID> --output json \
  | jq -r '.notebook_output.result // .error // "No output"'

For DLT pipeline failures:

bash
databricks pipelines list-pipeline-events <PID> --output json \
  | jq '[.events[] | select(.level == "ERROR")] | .[0:5]'

Match the error against Section 6 (decision tree) and references/error-solution-matrix.md.

Step 5: Fix

  1. Read the source file(s) identified from the stack trace or error message
  2. Apply fix using editor tools (StrReplace, Write)
  3. If YAML/config issue → fix the DAB YAML (consult common/databricks-asset-bundles skill)
  4. If dependency issue → deploy upstream assets first (see Section 6 ordering)

Step 6: Redeploy & Re-run (Loop)

bash
databricks bundle deploy -t <target>    # Redeploy with fix
databricks bundle run -t <target> <job>  # Re-run
# → Return to Step 3 (Poll)

Maximum 3 iterations. Track each iteration's error and fix.

Step 7: Escalation (After 3 Failed Attempts)

Present to the user:

  1. All errors encountered — with run IDs, task keys, error messages
  2. All fixes attempted — what was changed and why
  3. Root cause hypothesis — best guess based on evidence
  4. Run page URLs — direct links to the Databricks UI

Step 8: Capture Learnings

After every resolved failure (or escalation), trigger admin/self-improvement:

  • Add new error → fix mappings to references/error-solution-matrix.md
  • Update "Top 10" table in Section 6 if this error is frequent
  • Always capture if fix took 2+ iterations or was escalated

6. Master Troubleshooting Decision Tree

Full error-solution tables: references/error-solution-matrix.md (7 categories, 40+ patterns) DLT-specific troubleshooting: references/dlt-pipeline-troubleshooting.md Diagnostic SQL queries: references/diagnostic-queries.md

Quick Decision Guide

  1. Deployment error? → Check auth (403), YAML syntax, task type (notebook_task not python_task), path resolution
  2. Job runtime error? → Get task-level output (use task run_id), match error to solution matrix
  3. DLT pipeline error? → Get events via CLI, check upstream tables exist, verify schema
  4. Monitor/Alert error? → Check ResourceAlreadyExists (delete+recreate), schema existence, update_mask
  5. Dependency error? → Follow required order: Bronze → Gold Setup → Gold Merge → Semantic → Monitoring → Genie
  6. Cluster error? → Check events via CLI, look for OOM/unreachable/timeout patterns

Top 10 Most Common Errors (Inline)

ErrorQuick Fix
ModuleNotFoundErrorAdd to %pip install or DAB environment spec
TABLE_OR_VIEW_NOT_FOUNDRun setup job first; check 3-part catalog.schema.table path
DELTA_MULTIPLE_SOURCE_ROW_MATCHINGDeduplicate source before MERGE
Invalid access token (403)databricks auth login --host <url> --profile <name>
ResourceAlreadyExistsDelete + recreate (monitors, alerts)
python_task not recognizedUse notebook_task with notebook_path
PARSE_SYNTAX_ERRORRead failing SQL file, fix syntax, redeploy
Parameter not foundUse base_parameters dict, not CLI-style parameters
run_job_task vs job_taskUse run_job_task (not job_task)
Genie INTERNAL_ERRORDeploy semantic layer (TVFs + Metric Views) first

7. Job Design for Monitorability

Full patterns: references/job-design-patterns.md — structured output, progress messages, fail-fast, partial success, 3-layer hierarchy.

Key rules:

  • Every notebook must dbutils.notebook.exit(json.dumps({...})) with structured status
  • Print [Step N/M] progress messages so the agent can locate failures
  • Fail fast: include actionable error messages (e.g., "Run bronze_setup_job first")
  • 3-layer hierarchy: atomic (notebook_task) → composite (run_job_task) → orchestrator (run_job_task)
  • Partial success: ≥90% items succeeding = OK; fix individual failures

8. Agent Behavioral Patterns

Background Command Pattern (Cursor-Specific)

For databricks bundle run or any long-running command:

  1. Execute with block_until_ms: 0 to background immediately
  2. Read the terminal output file to check for completion
  3. Parse RUN_ID from the output URL

Exponential Backoff for Polling

  • Normal jobs: 30s → 60s → 120s (no state change = increase interval)
  • Long-running training jobs: start at 60s, max 300s
  • DLT pipelines: 30s (typically faster)

Context Capture on Every Failure

On each failure, record and maintain:

  • run_id and task_run_id
  • job_name and task_key
  • Target environment (dev, staging, prod)
  • Full error message
  • run_page_url for UI access
  • Iteration count (which fix attempt this is)

Self-Healing Loop

code
Iteration 1: Deploy → Run → Monitor → [FAIL] → Diagnose → Fix → Redeploy
Iteration 2: Run → Monitor → [FAIL] → Diagnose → Fix → Redeploy
Iteration 3: Run → Monitor → [FAIL] → ESCALATE TO USER

Dependency Awareness

Before deploying, verify prerequisites exist:

  • Genie Spaces require: TVFs + Metric Views (semantic layer)
  • Monitoring requires: Gold tables (populated)
  • Gold merge requires: Gold tables (setup job)
  • Silver DLT requires: Bronze tables
  • Alerts require: Monitoring output tables
  • Always follow: Bronze → Gold → Semantic → Monitoring → Alerts → Genie

Partial Success Handling

Some jobs use ≥90% success thresholds:

  • Alerting deployment: 27/30 alerts succeeding = OK, fix the 3 individually
  • Monitoring setup: 7/8 monitors = OK, debug the 1 failure
  • Do NOT fail the entire workflow for 1 out of N items failing

CLI Limitations Awareness

Some diagnostics require the Databricks Workspace UI:

  • Full notebook execution logs (when get-run-output returns truncated or empty)
  • DLT pipeline data flow visualization
  • Detailed Spark execution plans

Always provide run_page_url to the user when CLI output is insufficient.

Learning Capture Pattern (Self-Improvement Integration)

After every troubleshooting session (successful or escalated), run this checklist:

code
Post-Troubleshooting Self-Improvement:
1. Was this error already in references/error-solution-matrix.md?
   ├── YES → Was the documented fix accurate? If not, update it.
   └── NO  → Add the error pattern and fix to the matrix.

2. Did the initial diagnosis match the actual root cause?
   ├── YES → No action needed.
   └── NO  → Document the misleading signals in the relevant skill.

3. How many iterations were needed?
   ├── 1 → Capture only if error was novel or non-obvious.
   ├── 2-3 → ALWAYS update the relevant skill with the fix.
   └── Escalated → Document what the user found; update skills with user's fix.

4. Is this error likely to recur?
   ├── YES → Ensure it's in Top 10 table (if frequent) or error-solution matrix.
   └── NO  → Document as a one-off note in the session, skip skill update.

Skill priority for updates (search in this order):

  1. references/error-solution-matrix.md — error-to-fix mapping
  2. common/databricks-autonomous-operations — troubleshooting decision tree
  3. common/databricks-asset-bundles — if deployment-related
  4. Domain-specific skill (e.g., gold/gold-layer-merge-patterns for MERGE errors)
  5. New skill — only if all above are irrelevant (see admin/self-improvement justification checklist)

Safety: Never Retry Destructive Operations

Without explicit user confirmation, NEVER retry:

  • databricks bundle destroy
  • DROP TABLE / DROP SCHEMA
  • w.quality_monitors.delete()
  • w.alerts_v2.delete_alert()

References

Reference Documents (Load on Demand)

  • references/sdk-api-reference.md — Full SDK API with code examples for all services
  • references/error-solution-matrix.md — Error tables for all 7 resource categories (40+ patterns)
  • references/diagnostic-queries.md — SQL queries for investigation
  • references/dlt-pipeline-troubleshooting.md — DLT pipeline failure patterns
  • references/cli-jq-patterns.md — jq patterns for parsing CLI JSON output
  • references/cli-quick-reference.md — CLI vs SDK side-by-side table
  • references/job-design-patterns.md — Structured output, progress messages, 3-layer hierarchy

Runnable Examples

  • examples/1-authentication.py through examples/6-autonomous-operations.py (see Section 1)

Scripts

  • scripts/monitor_multitask_job.sh — Multi-task job monitoring shell script

External Docs