SQL Alerting: Config-Driven Framework for Databricks
Overview
This skill provides a comprehensive config-driven framework for setting up Databricks SQL Alerts using the V2 API, a Delta configuration table, and SDK deployment. It covers alert rule design, SQL query patterns, SDK integration (both V2 dict-based and typed classes), hierarchical job architecture, proactive query validation, partial success deployment, and troubleshooting workflows.
When to Use This Skill
- •Setting up SQL alerts for Gold layer tables
- •Creating alert configuration tables
- •Deploying alerts via Databricks SDK (V2 or typed classes)
- •Troubleshooting alert failures
- •Implementing config-driven alerting (runtime updates without code changes)
- •Creating severity-based notification routing
- •Setting up hierarchical job architecture for alerting pipelines
- •Validating alert queries before deployment
Core Principles
Principle 1: Config-Driven Alerting
Alert rules are stored in a Delta configuration table, not hardcoded. This enables:
- •Runtime updates without code changes
- •Centralized alert management
- •Version history via Delta time travel
- •Easy enable/disable without deployment
Principle 2: Hierarchical Job Architecture
⚠️ CRITICAL: Use atomic + composite jobs pattern for production alerting:
Layer 1 (Atomic - single notebook per job): ├── alerting_tables_setup_job.yml → setup_alerting_tables.py ├── seed_all_alerts_job.yml → seed_all_alerts.py ├── alert_query_validation_job.yml → validate_alert_queries.py ├── notification_destinations_sync_job.yml → sync_notification_destinations.py └── sql_alert_deployment_job.yml → sync_sql_alerts.py Layer 2 (Composite - orchestrates via run_job_task): └── alerting_layer_setup_job.yml → References all atomic jobs
Composite Job Pattern:
tasks:
- task_key: setup_alerting_tables
run_job_task: # ✅ NOT notebook_task!
job_id: ${resources.jobs.alerting_tables_setup_job.id}
- task_key: deploy_sql_alerts
depends_on:
- task_key: validate_alert_queries
run_job_task:
job_id: ${resources.jobs.sql_alert_deployment_job.id}
Benefits: Test atomic jobs independently, debug failures at specific step, run subsets of pipeline.
Minimal Alternative: Two-Job Pattern
For simpler setups, a two-job separation still works:
- •Setup Job (
alert_rules_setup_job): Creates/updates thealert_rulesconfig table - •Deploy Job (
alert_deploy_job): Reads config table and creates/updates SQL Alerts via SDK
Why This Matters:
- •Rules can be modified in Delta without redeploying alerts
- •Dry-run capability for validation
- •Clear separation between configuration and deployment
Principle 3: Fully Qualified Table Names
⚠️ CRITICAL: Databricks SQL Alerts (Public Preview) do NOT support parameters in queries.
-- ❌ WRONG: Parameterized query (NOT SUPPORTED)
SELECT * FROM ${catalog}.${schema}.fact_booking_daily
-- ✅ CORRECT: Fully qualified table names embedded in query
SELECT * FROM wanderbricks_dev.gold.fact_booking_daily
Pattern: Use f-strings at rule creation time to embed catalog/schema:
rev_001_query = f"""
SELECT ...
FROM {catalog}.{gold_schema}.fact_booking_daily
WHERE ...
"""
Principle 4: Severity-Based Notification Routing
Alerts are categorized by severity with different notification strategies:
| Severity | Icon | Action Required | Notification Speed |
|---|---|---|---|
| CRITICAL | 🔴 | Immediate | Real-time (email + Slack) |
| WARNING | 🟡 | Investigate soon | Batched (email) |
| INFO | 🟢 | Informational | Daily digest |
Principle 5: Proactive Query Validation
⚠️ CRITICAL: Run EXPLAIN on all alert queries before deployment to catch column/table errors early.
def validate_alert_query(spark, alert_id: str, query: str) -> tuple:
"""Validate query using EXPLAIN."""
try:
spark.sql(f"EXPLAIN {query}")
return (alert_id, True, None)
except Exception as e:
if "UNRESOLVED_COLUMN" in str(e):
return (alert_id, False, "Column not found")
elif "TABLE_OR_VIEW_NOT_FOUND" in str(e):
return (alert_id, False, "Table not found")
return (alert_id, False, f"Error: {str(e)[:100]}")
Create a dedicated validation job (alert_query_validation_job) that tests all alert queries before deployment. This catches:
- •Missing tables (renamed/dropped)
- •Missing columns (schema evolution)
- •SQL syntax errors
- •Permission issues
Principle 6: Partial Success Tolerance
⚠️ CRITICAL: Allow deployment to succeed if ≥90% of alerts deploy successfully. Don't fail the entire deployment for a single alert issue.
# Allow job to succeed if ≥90% of alerts deploy successfully
if success_rate >= 90:
print(f"⚠️ Partial success: {success_rate:.0f}% ({success_count}/{total})")
else:
raise RuntimeError(f"Too many failures: {len(errors)} errors")
Principle 7: DataFrame-Based Config Seeding
⚠️ CRITICAL: Never use SQL INSERT for seeding alert configurations. Use DataFrame instead.
Problem: SQL INSERT with replace("'", "''") breaks LIKE patterns:
-- Original: WHERE sku_name LIKE '%ALL_PURPOSE%' -- After INSERT escaping: WHERE sku_name LIKE %ALL_PURPOSE% (quotes lost!)
Solution: Use DataFrame with explicit schema:
# ✅ CORRECT: DataFrame handles escaping automatically
rows = [(alert_id, alert_name, alert_query, ...)]
df = spark.createDataFrame(rows, schema)
df.write.mode("append").saveAsTable(cfg_table)
Quick Reference
Alert ID Convention
Format: <DOMAIN>-<NUMBER>-<SEVERITY>
Components:
- •
DOMAIN: Business domain (3-4 chars): REV, ENG, PROP, HOST, CUST, COST, SECURITY, PERF - •
NUMBER: Sequential within domain (3 digits, zero-padded) - •
SEVERITY: CRIT, WARN, or INFO
Examples:
- •
REV-001-CRIT→ Revenue domain, alert #1, critical severity - •
ENG-003-WARN→ Engagement domain, alert #3, warning severity - •
PROP-004-INFO→ Property domain, alert #4, informational - •
COST-001-CRIT→ Cost domain, alert #1, critical severity - •
SECURITY-003-WARN→ Security domain, alert #3, warning - •
PERF-005-INFO→ Performance domain, alert #5, informational
Alert Rules Configuration Table Schema
Required Columns for SDK Deployment:
| Column | SDK Field | Required | Notes |
|---|---|---|---|
alert_query | query_text | ✅ | Full SQL query |
condition_column | AlertOperandColumn.name | ✅ | Column to check |
condition_operator | AlertConditionOperator | ✅ | >, <, =, etc. |
condition_threshold | AlertOperandValue.string_value | ✅ | Threshold value |
schedule_cron | cron_schedule | ✅ | Quartz format |
schedule_timezone | cron_timezone | ✅ | IANA timezone |
See alert-patterns.md for complete schema definition.
SQL Query Patterns
- •Threshold Comparison - Alert when metric crosses threshold
- •Percentage Change from Baseline - Alert when metric deviates from historical average
- •Statistical Anomaly Detection (Z-Score) - Alert when metric is statistically unusual
- •Count-Based Alert - Alert when count is below threshold
- •Informational Summary - Always triggers for daily/weekly summaries
See alert-patterns.md for detailed SQL examples.
SDK Setup (Critical!)
⚠️ This is the #1 operational gotcha. The SDK must be upgraded at runtime for V2 API support:
# Databricks notebook source # MAGIC %pip install --upgrade databricks-sdk>=0.40.0 --quiet # COMMAND ---------- # MAGIC %restart_python # COMMAND ---------- from databricks.sdk import WorkspaceClient from databricks.sdk.service.sql import AlertV2
V2 API Payload Structure
alert_dict = {
"display_name": "[CRITICAL] Alert Name",
"query_text": "SELECT column FROM catalog.schema.table WHERE condition",
"warehouse_id": "warehouse-id",
"schedule": {
"quartz_cron_schedule": "0 0 * * * ?", # Every hour
"timezone_id": "America/Los_Angeles",
"pause_status": "UNPAUSED"
},
"evaluation": {
"source": {"name": "column_name", "aggregation": "SUM"},
"comparison_operator": "GREATER_THAN",
"threshold": {"value": {"double_value": 1000}},
"empty_result_state": "OK",
"notification": {
"notify_on_ok": False,
"subscriptions": [{"user_email": "user@company.com"}]
}
}
}
alert_v2 = AlertV2.from_dict(alert_dict)
ws.alerts_v2.create_alert(alert_v2)
See sdk-api-reference.md for complete V2 API details.
Comparison Operators
| Operator | API Value |
|---|---|
> | GREATER_THAN |
>= | GREATER_THAN_OR_EQUAL |
< | LESS_THAN |
<= | LESS_THAN_OR_EQUAL |
= | EQUAL |
!= | NOT_EQUAL |
IS NULL | IS_NULL |
Aggregation Types
SUM, COUNT, COUNT_DISTINCT, AVG, MEDIAN, MIN, MAX, STDDEV, FIRST (null in API)
Critical Rules
Rule 1: Fully Qualified Table Names Only
❌ WRONG: Parameterized queries (NOT SUPPORTED)
SELECT * FROM ${catalog}.${schema}.fact_booking_daily
✅ CORRECT: Fully qualified names embedded at rule creation
alert_query = f"""
SELECT * FROM {catalog}.{gold_schema}.fact_booking_daily
WHERE check_in_date = DATE_ADD(CURRENT_DATE(), -1)
"""
Rule 2: Query Must Return Rows Only When Condition Met
❌ WRONG: Returns rows always
SELECT rate FROM ... WHERE rate IS NOT NULL
✅ CORRECT: Only returns rows when threshold crossed
SELECT rate FROM ... HAVING rate > 15
Rule 3: Always Include alert_message Column
✅ CORRECT: Human-readable notification content
SELECT
cancellation_rate,
'CRITICAL: Cancellation rate at ' || cancellation_rate || '%' as alert_message
FROM ...
HAVING cancellation_rate > 15
Rule 4: Use NULLIF for Division
✅ CORRECT: Prevent division by zero errors
ROUND(SUM(cancellation_count) / NULLIF(SUM(booking_count), 0) * 100, 1) as cancellation_rate
Rule 5: Use DataFrame for Config Seeding (Never SQL INSERT)
⚠️ CRITICAL: SQL INSERT with string escaping breaks LIKE patterns in alert queries. See Principle 7.
# ❌ WRONG: SQL INSERT loses quotes in LIKE patterns
spark.sql(f"INSERT INTO {table} VALUES ('{alert_query.replace(chr(39), chr(39)+chr(39))}')")
# ✅ CORRECT: DataFrame handles escaping automatically
rows = [(alert_id, alert_name, alert_query, ...)]
df = spark.createDataFrame(rows, schema)
df.write.mode("append").saveAsTable(cfg_table)
Rule 6: No CHECK Constraints or DEFAULT Values in DDL
⚠️ Unity Catalog limitation: CHECK constraints and DEFAULT values are not supported in DDL. Validate in code instead.
# ❌ WRONG: CHECK constraints in DDL (not supported)
# severity STRING NOT NULL CHECK (severity IN ('CRITICAL', 'WARNING', 'INFO'))
# ✅ CORRECT: Validate in code
assert rule["severity"] in ("CRITICAL", "WARNING", "INFO"), f"Invalid severity: {rule['severity']}"
Rule 7: Validate Queries with EXPLAIN Before Deployment
⚠️ CRITICAL: Run EXPLAIN on all alert queries before deployment. See Principle 5.
# ✅ CORRECT: Validate before deploying
for rule in alert_rules:
alert_id, is_valid, error = validate_alert_query(spark, rule["alert_id"], rule["alert_query"])
if not is_valid:
print(f"❌ {alert_id}: {error}")
Core Patterns
Pattern 1: Alert Rules Configuration Table
CREATE TABLE IF NOT EXISTS {catalog}.{gold_schema}.alert_rules (
alert_id STRING NOT NULL,
alert_name STRING NOT NULL,
domain STRING NOT NULL,
severity STRING NOT NULL,
alert_description STRING NOT NULL,
alert_query STRING NOT NULL,
condition_column STRING NOT NULL,
condition_operator STRING NOT NULL,
condition_threshold STRING NOT NULL,
aggregation_type STRING,
schedule_cron STRING NOT NULL,
schedule_timezone STRING NOT NULL,
notification_emails STRING,
notification_slack_channel STRING,
custom_subject_template STRING,
custom_body_template STRING,
notify_on_ok BOOLEAN NOT NULL,
rearm_seconds INT,
is_enabled BOOLEAN NOT NULL,
tags STRING,
owner STRING NOT NULL,
record_created_timestamp TIMESTAMP NOT NULL,
record_updated_timestamp TIMESTAMP NOT NULL,
CONSTRAINT pk_alert_rules PRIMARY KEY (alert_id) NOT ENFORCED
)
USING DELTA
CLUSTER BY AUTO
Pattern 2a: SDK Alert Creation (Typed Classes)
Note: Works with databricks-sdk>=0.28.0. See Pattern 2b for the newer V2 dict-based approach.
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.sql import (
AlertCondition,
AlertConditionOperand,
AlertConditionThreshold,
AlertOperandColumn,
AlertOperandValue,
AlertConditionOperator,
)
def create_alert(ws: WorkspaceClient, rule: dict, warehouse_id: str):
"""Create a SQL Alert from a rule configuration."""
# Build condition
condition = AlertCondition(
op=get_operator_enum(rule["condition_operator"]),
operand=AlertConditionOperand(
column=AlertOperandColumn(name=rule["condition_column"])
),
threshold=AlertConditionThreshold(
value=AlertOperandValue(
string_value=str(rule["condition_threshold"])
)
)
)
# Create alert
new_alert = ws.alerts.create(
display_name=f"[{rule['severity']}] {rule['alert_name']}",
query_text=rule["alert_query"],
warehouse_id=warehouse_id,
condition=condition,
cron_schedule=rule["schedule_cron"],
cron_timezone=rule["schedule_timezone"],
notify_on_ok=rule.get("notify_on_ok", False),
custom_subject=rule.get("custom_subject_template"),
custom_body=rule.get("custom_body_template"),
)
return new_alert
Pattern 2b: SDK Alert Creation (V2 Dict-Based)
Note: Requires databricks-sdk>=0.40.0. This is the newer, recommended approach.
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.sql import AlertV2
def create_alert_v2(ws: WorkspaceClient, rule: dict, warehouse_id: str):
"""Create a SQL Alert using V2 dict-based API."""
alert_dict = {
"display_name": f"[{rule['severity']}] {rule['alert_name']}",
"query_text": rule["alert_query"],
"warehouse_id": warehouse_id,
"schedule": {
"quartz_cron_schedule": rule["schedule_cron"],
"timezone_id": rule["schedule_timezone"],
"pause_status": "UNPAUSED" if rule["is_enabled"] else "PAUSED"
},
"evaluation": {
"source": {
"name": rule["condition_column"],
"aggregation": rule.get("aggregation_type", "FIRST")
},
"comparison_operator": rule["condition_operator"],
"threshold": {
"value": {"double_value": float(rule["condition_threshold"])}
},
"empty_result_state": "OK",
"notification": {
"notify_on_ok": rule.get("notify_on_ok", False),
"subscriptions": [
{"user_email": email.strip()}
for email in rule.get("notification_emails", "").split(",")
if email.strip()
]
}
}
}
alert_v2 = AlertV2.from_dict(alert_dict)
return ws.alerts_v2.create_alert(alert_v2)
See sdk-api-reference.md for complete V2 API reference including comparison operators, aggregation types, and API gotchas.
Pattern 3: Idempotent Deployment
def deploy_alert(ws: WorkspaceClient, rule: dict, warehouse_id: str,
existing_alerts: dict, dry_run: bool = False):
"""Deploy alert with idempotency check."""
alert_name = f"[{rule['severity']}] {rule['alert_name']}"
existing = existing_alerts.get(alert_name)
if dry_run:
if existing:
print(f"[DRY RUN] Would update: {existing.id}")
else:
print(f"[DRY RUN] Would create new alert")
return {"status": "dry_run"}
if existing:
print(f"Alert exists: {existing.id} - skipping")
return {"status": "skipped", "id": existing.id}
# Create new alert
new_alert = create_alert(ws, rule, warehouse_id)
return {"status": "created", "id": new_alert.id}
Handling RESOURCE_ALREADY_EXISTS
When creating alerts, you may encounter RESOURCE_ALREADY_EXISTS errors. Handle by refreshing the alert list and updating instead:
try:
new_alert = ws.alerts_v2.create_alert(alert_v2)
except Exception as e:
if "RESOURCE_ALREADY_EXISTS" in str(e):
# Refresh alert list and update instead
existing_alerts = {a.display_name: a for a in ws.alerts_v2.list_alerts().alerts}
existing = existing_alerts.get(alert_name)
if existing:
ws.alerts_v2.update_alert(
alert_id=existing.id,
alert=alert_v2,
update_mask="query_text,evaluation,schedule" # ⚠️ Required!
)
else:
raise
Key: PATCH requests require the update_mask parameter specifying which fields to update.
Pattern 4: DAB Job Configuration
Minimal (2-Job) Configuration
Setup Job:
resources:
jobs:
alert_rules_setup_job:
name: "[${bundle.target}] Alert Rules - Setup"
description: "Creates alert_rules configuration table"
environments:
- environment_key: default
spec:
environment_version: "4"
tasks:
- task_key: setup_alert_rules
environment_key: default
notebook_task:
notebook_path: ../../src/alerting/setup_alert_rules.py
base_parameters:
catalog: ${var.catalog}
gold_schema: ${var.gold_schema}
Deploy Job:
resources:
jobs:
alert_deploy_job:
name: "[${bundle.target}] Alert - Deploy"
description: "Deploys SQL alerts from alert_rules configuration table"
environments:
- environment_key: default
spec:
environment_version: "4"
dependencies:
- "databricks-sdk>=0.40.0" # Required for V2 alerts API
tasks:
- task_key: deploy_alerts
environment_key: default
notebook_task:
notebook_path: ../../src/alerting/deploy_alerts.py
base_parameters:
catalog: ${var.catalog}
gold_schema: ${var.gold_schema}
warehouse_id: ${var.warehouse_id}
dry_run: "false"
Production (5+1 Hierarchical) Configuration
Composite Orchestrator Job:
resources:
jobs:
alerting_layer_setup_job:
name: "[${bundle.target}] Alerting Layer - Full Setup"
description: "Orchestrates all alerting setup jobs"
tasks:
- task_key: setup_alerting_tables
run_job_task:
job_id: ${resources.jobs.alerting_tables_setup_job.id}
- task_key: seed_all_alerts
depends_on:
- task_key: setup_alerting_tables
run_job_task:
job_id: ${resources.jobs.seed_all_alerts_job.id}
- task_key: validate_alert_queries
depends_on:
- task_key: seed_all_alerts
run_job_task:
job_id: ${resources.jobs.alert_query_validation_job.id}
- task_key: sync_notification_destinations
depends_on:
- task_key: setup_alerting_tables
run_job_task:
job_id: ${resources.jobs.notification_destinations_sync_job.id}
- task_key: deploy_sql_alerts
depends_on:
- task_key: validate_alert_queries
- task_key: sync_notification_destinations
run_job_task:
job_id: ${resources.jobs.sql_alert_deployment_job.id}
Pattern 5: Proactive Query Validation
def validate_alert_query(spark, alert_id: str, query: str) -> tuple:
"""Validate query using EXPLAIN."""
try:
spark.sql(f"EXPLAIN {query}")
return (alert_id, True, None)
except Exception as e:
if "UNRESOLVED_COLUMN" in str(e):
return (alert_id, False, "Column not found")
elif "TABLE_OR_VIEW_NOT_FOUND" in str(e):
return (alert_id, False, "Table not found")
return (alert_id, False, f"Error: {str(e)[:100]}")
# Usage: validate all rules before deployment
def validate_all_rules(spark, rules: list) -> tuple:
"""Validate all alert queries and return results."""
results = [validate_alert_query(spark, r["alert_id"], r["alert_query"]) for r in rules]
valid = [r for r in results if r[1]]
invalid = [r for r in results if not r[1]]
for alert_id, _, error in invalid:
print(f"❌ {alert_id}: {error}")
print(f"✅ {len(valid)}/{len(results)} queries valid")
return valid, invalid
Pattern 6: Partial Success Deployment
def deploy_all_alerts(ws, rules, warehouse_id, dry_run=False):
"""Deploy all alerts with partial success tolerance."""
existing_alerts = {a.display_name: a for a in ws.alerts_v2.list_alerts().alerts}
results = []
errors = []
for rule in rules:
try:
result = deploy_alert(ws, rule, warehouse_id, existing_alerts, dry_run)
results.append(result)
except Exception as e:
errors.append({"alert_id": rule["alert_id"], "error": str(e)})
success_count = len(results)
total = success_count + len(errors)
success_rate = (success_count / total * 100) if total > 0 else 100
# Allow job to succeed if ≥90% of alerts deploy successfully
if success_rate >= 90:
print(f"⚠️ Partial success: {success_rate:.0f}% ({success_count}/{total})")
else:
raise RuntimeError(f"Too many failures: {len(errors)} errors")
return results, errors
File Structure
src/alerting/ ├── alerting_config.py # Config helpers, dataclasses ├── alerting_metrics.py # Metrics collection ├── setup_alerting_tables.py # Creates Delta config tables ├── seed_all_alerts.py # Seeds alert configs (DataFrame-based) ├── validate_alert_queries.py # EXPLAIN-based query validation ├── sync_notification_destinations.py # Syncs notification channels └── sync_sql_alerts.py # SDK-based sync engine resources/alerting/ ├── alerting_layer_setup_job.yml # Layer 2: Composite orchestrator ├── alerting_tables_setup_job.yml # Layer 1: Atomic ├── seed_all_alerts_job.yml # Layer 1: Atomic ├── alert_query_validation_job.yml # Layer 1: Atomic ├── notification_destinations_sync_job.yml # Layer 1: Atomic └── sql_alert_deployment_job.yml # Layer 1: Atomic
Reference Files
alert-patterns.md
Comprehensive SQL alert patterns including:
- •SQL query patterns (threshold, percentage change, anomaly detection, count-based, informational)
- •Complete alert examples (CRITICAL, WARNING, INFO)
- •Custom notification templates
- •Schedule patterns (Quartz cron)
- •Query design rules
sdk-api-reference.md
Complete SDK and V2 API reference including:
- •V2 API payload structure with field descriptions
- •Comparison operators table
- •Aggregation types reference
- •API gotchas and common pitfalls
- •SDK version requirements
- •SDK setup ceremony for notebooks
Assets
alert-config.yaml
Template configuration file for alert rules with example structure.
Troubleshooting
Problem: Alert Not Triggering
Symptoms: Alert is enabled but never sends notifications
Diagnosis Steps:
- •Run the alert query manually - does it return rows?
- •Check condition: does returned value satisfy operator + threshold?
- •Verify warehouse is running when schedule fires
- •Check notification destination configuration
Fix: Add HAVING clause to filter results:
-- ✅ CORRECT: Only returns rows when threshold crossed SELECT rate FROM ... HAVING rate > 15
Problem: Alert Always Triggering
Symptoms: Getting notifications even when condition shouldn't be met
Fix: Query returns rows even when condition isn't met - missing HAVING clause
Problem: "Query Failed" Error Status
Common Causes:
- •Table doesn't exist (wrong catalog/schema)
- •Column doesn't exist
- •SQL syntax error
- •Warehouse unavailable
Debug: Test query in notebook first:
spark.sql(f"""
{alert_query}
""").display()
Problem: SDK Permission Error
Fix: Ensure service principal or user has:
- •
CAN_MANAGEpermission on SQL Warehouse - •
CAN_CREATEpermission for SQL Alerts - •
CAN_USEon catalog and schema
Problem: RESOURCE_ALREADY_EXISTS on Create
Symptoms: RESOURCE_ALREADY_EXISTS error when creating an alert that was previously deleted or has the same display name.
Fix: Refresh the alert list and update instead of creating:
# Refresh alert list
existing_alerts = {a.display_name: a for a in ws.alerts_v2.list_alerts().alerts}
existing = existing_alerts.get(alert_name)
if existing:
ws.alerts_v2.update_alert(
alert_id=existing.id,
alert=alert_v2,
update_mask="query_text,evaluation,schedule" # ⚠️ Required!
)
Problem: LIKE Pattern Quotes Lost in Config Table
Symptoms: Alert queries with LIKE patterns ('%ALL_PURPOSE%') lose their quotes after being inserted into the config table via SQL INSERT.
Root Cause: replace("'", "''") breaks LIKE patterns during SQL INSERT escaping.
Fix: Use DataFrame-based seeding instead of SQL INSERT:
# ✅ CORRECT: DataFrame handles escaping automatically
rows = [(alert_id, alert_name, alert_query, ...)]
df = spark.createDataFrame(rows, schema)
df.write.mode("append").saveAsTable(cfg_table)
Problem: SDK ImportError for AlertV2
Symptoms: ImportError: cannot import name 'AlertV2' from 'databricks.sdk.service.sql'
Root Cause: SDK version is too old. AlertV2 requires databricks-sdk>=0.40.0.
Fix: Upgrade SDK at runtime:
# MAGIC %pip install --upgrade databricks-sdk>=0.40.0 --quiet # COMMAND ---------- # MAGIC %restart_python
Problem: PATCH Update Fails
Symptoms: Alert update via PATCH returns an error or silently fails.
Root Cause: Missing update_mask parameter. PATCH requests require specifying which fields to update.
Fix: Include update_mask in update calls:
ws.alerts_v2.update_alert(
alert_id=existing.id,
alert=alert_v2,
update_mask="query_text,evaluation,schedule" # ⚠️ Required!
)
Best Practices
✅ DO
- •Use config table for all rules - Never hardcode alert configurations
- •Include
alert_messagecolumn - Human-readable notification content - •Test queries manually first - Verify in notebook before adding to config
- •Use NULLIF for division - Prevent division by zero errors
- •Set appropriate rearm periods - Prevent alert fatigue (1800-3600 seconds)
- •Enable notify_on_ok for critical - Know when issues are resolved
- •Use dry_run for validation - Test deployment without creating alerts
- •Run EXPLAIN on all alert queries before deployment - Catch errors early (see Principle 5)
- •Use DataFrame (not SQL INSERT) for config seeding - Prevent escaping bugs (see Principle 7)
- •Allow partial success (≥90%) for deployment jobs - Don't fail all for one (see Principle 6)
❌ DON'T
- •Don't use parameters in queries - SQL Alerts don't support
${param}syntax - •Don't skip the HAVING clause - Queries should only return rows when alerting
- •Don't set rearm too low - Causes notification spam
- •Don't hardcode credentials - Use WorkspaceClient() for auto-auth
- •Don't skip schema validation - Verify alert_rules table exists before deploying
- •Don't use SQL INSERT for queries with LIKE patterns - Quotes get lost (see Rule 5)
- •Don't use CHECK or DEFAULT in config table DDL - Not supported in Unity Catalog (see Rule 6)
Workflow Summary
Initial Setup (Hierarchical)
# 1. Deploy all alerting infrastructure (composite orchestrator) databricks bundle run alerting_layer_setup_job -t dev # Or run individual atomic jobs (see below)
Initial Setup (Minimal 2-Job)
# 1. Deploy alert rules table
databricks bundle run alert_rules_setup_job -t dev
# 2. Verify rules
# SELECT * FROM {catalog}.{gold_schema}.alert_rules
# 3. Deploy alerts (dry run first)
# Edit dry_run: "true" in job config
databricks bundle run alert_deploy_job -t dev
# 4. Deploy alerts (for real)
# Edit dry_run: "false" in job config
databricks bundle run alert_deploy_job -t dev
Running Individual Jobs
Run atomic jobs independently for debugging or partial updates:
# Just set up tables databricks bundle run alerting_tables_setup_job -t dev # Just seed alert configurations databricks bundle run seed_all_alerts_job -t dev # Just validate queries (after config changes) databricks bundle run alert_query_validation_job -t dev # Just sync notification destinations databricks bundle run notification_destinations_sync_job -t dev # Just deploy alerts (after validation passes) databricks bundle run sql_alert_deployment_job -t dev
Adding New Alerts
- •Add rule to
get_alert_rules()function inseed_all_alerts.py(orsetup_alert_rules.pyfor 2-job pattern) - •Run seed job:
databricks bundle run seed_all_alerts_job -t dev - •Run validation:
databricks bundle run alert_query_validation_job -t dev - •Run deploy job:
databricks bundle run sql_alert_deployment_job -t dev
Modifying Existing Alerts
- •Update rule in config table or seed script
- •Run seed job to update config table
- •Delete existing alert in Databricks UI
- •Run deploy job to recreate with new configuration
Key Learnings
Hard-won operational lessons from production deployment:
- •API Endpoint: Use
/api/2.0/alerts(NOT/api/2.0/sql/alerts-v2) - •SDK Version: Requires
%pip install --upgrade databricks-sdk>=0.40.0+%restart_python - •List Response: V2 API returns
alertskey (NOTresults) - •Update Mask: PATCH requests require
update_maskparameter - •RESOURCE_ALREADY_EXISTS: Handle by refreshing alert list and updating instead
- •Fully Qualified Names: Alerts don't support query parameters
- •⚠️ SQL Escaping: Use DataFrame (not SQL INSERT) for inserting queries - LIKE patterns lose quotes otherwise
- •Proactive Validation: Run EXPLAIN on all queries before deployment to catch column/table errors
- •Hierarchical Jobs: Atomic jobs (single notebook) + Composite jobs (
run_job_task) for production - •Partial Success: Allow ≥90% success rate - don't fail entire deployment for single alert issues
- •Unity Catalog: No CHECK constraints, no DEFAULT values in DDL - validate in code
References
- •Databricks SQL Alerts Documentation
- •Alert Notifications
- •Databricks SDK - Alerts API
- •Databricks SDK - Alerts V2 API
- •V2 API Documentation
- •Quartz Cron Expression Format
Summary
Key Takeaways:
- •Use config-driven approach with Delta table for runtime updates
- •Hierarchical 5+1 job architecture for production (or minimal 2-job for simple setups)
- •Fully qualified table names only (no parameters)
- •Queries must return rows ONLY when condition is met (HAVING clause)
- •Always include
alert_messagecolumn for notifications - •Use NULLIF for division to prevent errors
- •Validate all queries with EXPLAIN before deployment
- •Use DataFrame-based config seeding (never SQL INSERT)
- •Allow partial success (≥90%) for deployment resilience
- •Use V2 dict-based SDK pattern (>=0.40.0) for newest API features
Next Steps:
- •Review alert-patterns.md for SQL query examples
- •Review sdk-api-reference.md for V2 API details
- •Set up alert_rules configuration table
- •Create hierarchical job architecture
- •Test with dry_run before production deployment