AgentSkillsCN

observability-setup

为Databricks湖仓监控(数据探查)提供全面指南,包含快速入门工作流(2小时)、填空式需求模板、具体的事实/维度监控示例,以及完整的部署模式。使用全新的数据质量API(`databricks.sdk.service.dataquality`)。在为黄金层表设置湖仓监控、创建自定义业务指标、设计监控策略、为仪表板查询监控表,或排查监控初始化失败时使用此功能。涵盖优雅降级的设置模式、自定义指标语法(AGGREGATE、DERIVED、DRIFT)、以`input_columns=[":table"]`为核心的表级业务KPI,为仪表板设计查询模式、处理异步操作、监控清理、Genie文档集成,以及生产部署工作流。

SKILL.md
--- frontmatter
name: observability-setup
description: >
  End-to-end orchestrator for setting up Databricks observability including Lakehouse Monitoring,
  AI/BI Dashboards, and SQL Alerts. Guides users through monitor creation for Gold tables,
  dashboard design with monitoring widgets, and config-driven alerting. Orchestrates mandatory
  dependencies on monitoring skills (lakehouse-monitoring-comprehensive, databricks-aibi-dashboards,
  sql-alerting-patterns) and common skills (databricks-asset-bundles, databricks-expert-agent,
  databricks-python-imports).
  Use when setting up observability end-to-end, creating Lakehouse Monitors, building dashboards,
  or configuring SQL alerts.
license: Apache-2.0
metadata:
  author: prashanth subrahmanyam
  version: "1.0.0"
  domain: monitoring
  role: orchestrator
  pipeline_stage: 7
  pipeline_stage_name: observability
  next_stages:
    - ml-pipeline-setup
  workers:
    - lakehouse-monitoring-comprehensive
    - databricks-aibi-dashboards
    - sql-alerting-patterns
  common_dependencies:
    - databricks-asset-bundles
    - databricks-expert-agent
    - databricks-python-imports
    - naming-tagging-standards
    - databricks-autonomous-operations
  consumes:
    - plans/manifests/observability-manifest.yaml
  consumes_fallback: "Gold table inventory (self-discovery from catalog)"
  dependencies:
    - lakehouse-monitoring-comprehensive
    - databricks-aibi-dashboards
    - sql-alerting-patterns
    - databricks-asset-bundles
    - databricks-expert-agent
    - databricks-python-imports
  last_verified: "2026-02-07"
  volatility: medium
  upstream_sources: []  # Internal orchestrator

Observability Setup Orchestrator

End-to-end workflow for setting up Databricks observability — Lakehouse Monitoring, AI/BI Dashboards, and SQL Alerts — on top of a completed Gold layer and semantic layer.

Predecessor: semantic-layer-setup skill (Semantic layer should be complete, but Gold tables are the minimum requirement)

Time Estimate: 3-5 hours for initial setup, 30 min per additional table/dashboard

What You'll Create:

  1. Lakehouse Monitors — data quality, drift, and custom business KPIs for Gold tables
  2. AI/BI Dashboards — Lakeview dashboards with monitoring widgets and business metrics
  3. SQL Alerts — config-driven alerting with severity-based routing

Decision Tree

QuestionAction
Setting up observability end-to-end?Use this skill — it orchestrates everything
Only need Lakehouse Monitoring?Read monitoring/01-lakehouse-monitoring-comprehensive/SKILL.md directly
Only need AI/BI Dashboards?Read monitoring/02-databricks-aibi-dashboards/SKILL.md directly
Only need SQL Alerts?Read monitoring/03-sql-alerting-patterns/SKILL.md directly

Mandatory Skill Dependencies

CRITICAL: Before generating ANY code for observability, you MUST read and follow the patterns in these common skills. Do NOT generate these patterns from memory.

PhaseMUST Read Skill (use Read tool on SKILL.md)What It Provides
All phasescommon/databricks-expert-agentCore extraction principle: extract names from source, never hardcode
Monitor scriptscommon/databricks-python-importsPure Python module patterns for helpers
Job deploymentcommon/databricks-asset-bundlesJob YAML, deployment patterns
Troubleshootingcommon/databricks-autonomous-operationsDeploy → Poll → Diagnose → Fix → Redeploy loop when jobs fail

Monitoring-Domain Dependencies

SkillRequirementWhat It Provides
monitoring/01-lakehouse-monitoring-comprehensiveMUST read at Phase 1Monitor setup, custom metrics, graceful degradation
monitoring/02-databricks-aibi-dashboardsMUST read at Phase 2Dashboard JSON, widget patterns, deployment
monitoring/03-sql-alerting-patternsMUST read at Phase 3Config-driven alerts, SDK deployment, severity routing

🔴 Non-Negotiable Defaults

DefaultValueApplied WhereNEVER Do This Instead
Monitor typeMonitorTimeSeries or MonitorSnapshotEvery Lakehouse Monitor❌ NEVER skip monitor type selection
Custom metricsinput_columns=[":table"] for table-level KPIsEvery custom business metric❌ NEVER use column-level when table-level is needed
Dashboard deploymentLakeview JSON with dataset_catalog/dataset_schemaEvery dashboard❌ NEVER hardcode catalog/schema in dashboard queries
Alert queriesFully qualified table names (no parameters)Every SQL alert query❌ NEVER use parameterized table names in alerts
Serverlessenvironments: block with environment_keyEvery monitoring job❌ NEVER define job_clusters:

Phased Implementation Workflow

Phase 0: Read Plan (5 minutes)

Before starting implementation, check for a planning manifest that defines what to build.

python
import yaml
from pathlib import Path

manifest_path = Path("plans/manifests/observability-manifest.yaml")

if manifest_path.exists():
    with open(manifest_path) as f:
        manifest = yaml.safe_load(f)
    
    # Extract implementation checklist from manifest
    monitors = manifest.get('lakehouse_monitors', [])
    dashboards = manifest.get('dashboards', [])
    alerts = manifest.get('alerts', [])
    print(f"Plan: {len(monitors)} monitors, {len(dashboards)} dashboards, {len(alerts)} alerts")
    
    # Each monitor has: table_name, monitor_type, custom_metrics, slicing_exprs
    # Each dashboard has: name, pages, widgets
    # Each alert has: alert_id, severity, query, threshold, schedule
else:
    # Fallback: self-discovery from Gold tables
    print("No manifest found — falling back to Gold table self-discovery")
    # Discover Gold tables from catalog, create one monitor per table

If manifest exists: Use it as the implementation checklist. Every monitor, dashboard, and alert is pre-defined with configuration details. Track completion against the manifest's summary counts.

If manifest doesn't exist: Fall back to self-discovery — inventory Gold tables, create one monitor per table (TimeSeries for facts, Snapshot for dimensions), and generate standard dashboards and alerts. This works but may miss custom business KPIs the planning phase would have defined.


Phase 1: Lakehouse Monitoring (1-2 hours)

MANDATORY: Read each skill below using the Read tool BEFORE writing any code for this phase:

#Skill PathWhat It Provides
1data_product_accelerator/skills/common/databricks-expert-agent/SKILL.mdExtract-don't-generate principle
2data_product_accelerator/skills/monitoring/01-lakehouse-monitoring-comprehensive/SKILL.mdMonitor setup, custom metrics

Steps:

  1. Inventory Gold tables that need monitoring (fact tables are highest priority)
  2. Choose monitor type per table (TimeSeries for facts, Snapshot for dimensions)
  3. Define custom business metrics using input_columns=[":table"] for table-level KPIs
  4. Create monitor setup script with graceful degradation (delete-then-create pattern)
  5. Deploy monitors and verify metric tables are populated
  6. Document monitor configuration in Genie Space instructions (if applicable)

Phase 2: AI/BI Dashboards (1-2 hours)

MANDATORY: Read each skill below using the Read tool BEFORE writing any code for this phase:

#Skill PathWhat It Provides
1data_product_accelerator/skills/monitoring/02-databricks-aibi-dashboards/SKILL.mdDashboard JSON, widget patterns

Steps:

  1. Design dashboard layout: monitoring overview + business metrics sections
  2. Create queries using monitoring profile/drift tables
  3. Build widget configurations with proper number formatting
  4. Set dataset_catalog and dataset_schema for environment portability
  5. Deploy dashboard via Asset Bundle or API
  6. Validate all widgets render correctly

Phase 3: SQL Alerts (1 hour)

MANDATORY: Read each skill below using the Read tool BEFORE writing any code for this phase:

#Skill PathWhat It Provides
1data_product_accelerator/skills/monitoring/03-sql-alerting-patterns/SKILL.mdConfig-driven alerts, SDK deployment
2data_product_accelerator/skills/common/databricks-asset-bundles/SKILL.mdJob YAML for alert deployment

Steps:

  1. Create alert configuration table (Delta table-based, severity-driven)
  2. Define alert rules: threshold, percentage change, anomaly detection
  3. Deploy alerts via Databricks SDK (V2 dict-based or typed classes)
  4. Configure notification destinations per severity level
  5. Set up Quartz cron schedules for alert evaluation
  6. Validate alerts fire correctly with test data

Post-Creation Validation

Common Skill Compliance

  • Names extracted from Gold YAML (not generated) per databricks-expert-agent
  • Asset Bundle YAML follows databricks-asset-bundles patterns
  • Python imports follow databricks-python-imports patterns

Observability Specifics

  • Lakehouse Monitors created for all critical Gold tables
  • Custom business metrics use input_columns=[":table"] syntax
  • Monitor setup uses graceful degradation (delete-then-create)
  • Dashboard uses dataset_catalog/dataset_schema for portability
  • Dashboard widgets align with query columns
  • Alert queries use fully qualified table names (no parameters)
  • Alert severity routing configured (critical → PagerDuty, warning → email)
  • All monitoring jobs use serverless compute

Pipeline Progression

Previous stage: semantic-layer-setup → Metric Views, TVFs, and Genie Spaces should exist

Next stage: After completing observability, proceed to:

  • ml/00-ml-pipeline-setup — Set up ML models, experiments, and batch inference

Related Skills

SkillRelationshipPath
lakehouse-monitoring-comprehensiveMandatory — Monitor setupmonitoring/01-lakehouse-monitoring-comprehensive/SKILL.md
databricks-aibi-dashboardsMandatory — Dashboard patternsmonitoring/02-databricks-aibi-dashboards/SKILL.md
sql-alerting-patternsMandatory — Alert frameworkmonitoring/03-sql-alerting-patterns/SKILL.md
databricks-expert-agentMandatory — Extraction principlecommon/databricks-expert-agent/SKILL.md
databricks-asset-bundlesMandatory — Deploymentcommon/databricks-asset-bundles/SKILL.md
databricks-python-importsMandatory — Python patternscommon/databricks-python-imports/SKILL.md

References