AgentSkillsCN

adhoc-exploration-notebooks

为Databricks工作空间(.py)与本地Jupyter(.ipynb)创建双格式的临时探索笔记本,配合Databricks Connect使用。在构建数据探索工具、调试数据质量问题,或创建交互式分析笔记本时使用此功能。支持小部件回退模式、用于表发现的辅助函数,以及为两种环境正确初始化Spark会话。

SKILL.md
--- frontmatter
name: adhoc-exploration-notebooks
description: Create dual-format ad-hoc exploration notebooks for Databricks workspace (.py) and local Jupyter (.ipynb) with Databricks Connect. Use when building data exploration tools, debugging data quality issues, or creating interactive analysis notebooks. Supports widget fallback patterns, helper functions for table discovery, and proper Spark session initialization for both environments.
metadata:
  author: prashanth subrahmanyam
  version: "2.0"
  domain: exploration
  role: utility
  standalone: true
  last_verified: "2026-02-07"
  volatility: low
  upstream_sources: []  # Internal notebook patterns

Ad-Hoc Exploration Notebook Patterns

Overview

Every data product should include exploration notebooks in two formats to support different development workflows: Databricks workspace format (.py) for interactive use in Databricks UI, and Jupyter format (.ipynb) for local development with Databricks Connect. Magic commands (%pip, %sql, dbutils.library.restartPython()) only work in Databricks workspace, not with Databricks Connect or local execution.

Quick Start (30 minutes)

Goal: Create interactive notebooks for data exploration (works both in Databricks and locally).

What You'll Create:

  1. adhoc_exploration.py — Databricks workspace version (with magic commands)
  2. adhoc_exploration.ipynb — Local Jupyter version (Databricks Connect)
  3. Standard helper functions (list tables, explore, check quality, compare, show properties)

Before You Begin: Fill in the Requirements Template with your catalog, schemas, and key tables.

Fast Track: Databricks Workspace

python
# 1. Upload adhoc_exploration.py to Databricks workspace
# 2. Run setup cells (installs databricks-sdk)
# 3. Use helper functions:

list_tables(catalog, "my_schema")
explore_table(catalog, "my_schema", "fact_sales_daily")
check_data_quality(catalog, "my_schema", "fact_sales_daily")
compare_tables(catalog, "silver_schema", "silver_transactions",
               catalog, "gold_schema", "fact_sales_daily")

Fast Track: Local Jupyter

bash
# 1. Install requirements
pip install -r requirements.txt

# 2. Configure Databricks Connect
databricks configure --profile my_profile

# 3. Launch Jupyter and open adhoc_exploration.ipynb
jupyter lab

Key Differences:

  • Databricks (.py): Uses magic commands (%pip, %sql), dbutils.widgets.text() for params
  • Local (.ipynb): Pure Python, direct variable assignment, requires Databricks Connect

When to Use This Skill

  • Building data exploration tools for Bronze/Silver/Gold layers
  • Debugging data quality issues
  • Creating interactive analysis notebooks
  • Validating schema changes
  • Testing query patterns before production
  • Testing Metric Views and monitoring tables
  • Cross-layer data flow validation (Bronze → Silver → Gold)

Critical Rules

1. Dual-Format Requirement

CRITICAL: Create both .py (Databricks workspace) and .ipynb (local Jupyter) formats.

Why: Magic commands (%pip, %sql, dbutils.library.restartPython()) only work in Databricks workspace, not with Databricks Connect.

2. Spark Session Initialization

CRITICAL: Must specify either .serverless() or .clusterId() — default will fail:

python
# ✅ CORRECT
spark = DatabricksSession.builder.serverless().profile("profile").getOrCreate()
# OR
spark = DatabricksSession.builder.clusterId("cluster-id").getOrCreate()

# ❌ WRONG
spark = DatabricksSession.builder.getOrCreate()  # Exception: Cluster id or serverless required

3. Widget Fallback Pattern

Always use widget fallback pattern for cross-environment compatibility:

python
# Default values
catalog = "default_catalog"

# Try widgets (Databricks workspace), fall back to defaults
try:
    dbutils.widgets.text("catalog", catalog, "Catalog")
    catalog = dbutils.widgets.get("catalog")
except Exception:
    # Widgets not available (Databricks Connect)
    pass

Quick Reference

File Structure

code
src/exploration/
├── adhoc_exploration.py              # Databricks workspace format
├── adhoc_exploration.ipynb           # Local Jupyter format
├── requirements.txt                  # Python dependencies
├── README.md                         # Full documentation
└── QUICKSTART.md                     # 5-minute setup guide

Standard Helper Functions

Every exploration notebook should include:

  1. Table Discovery: list_tables(catalog_name, schema_name) — List all tables with metadata
  2. Table Exploration: explore_table(catalog_name, schema_name, table_name, limit) — Schema + sample data + row count
  3. Data Quality: check_data_quality(catalog_name, schema_name, table_name) — Null counts, duplicates
  4. Comparison: compare_tables(catalog1, schema1, table1, catalog2, schema2, table2) — Compare across layers
  5. Governance: show_table_properties(catalog_name, schema_name, table_name) — View metadata, tags, TBLPROPERTIES

See Notebook Patterns for complete implementations.

Core Patterns

Databricks Workspace Format (.py)

Key Features:

  • # Databricks notebook source header
  • # COMMAND ---------- cell separators
  • # MAGIC %md for markdown cells
  • Widget support with fallback pattern
  • Magic commands work (%pip, %sql)

See Notebook Patterns for complete template.

Local Jupyter Format (.ipynb)

Key Features:

  • Standard Jupyter notebook format
  • No Databricks-specific headers
  • No magic commands
  • Direct variable assignment (no widgets)
  • Setup instructions in markdown cell

Requirements:

txt
databricks-sdk[notebook]>=0.28.0
pyspark>=3.5.0

Multiple Execution Options

Option 1: Run via Databricks UI

  1. Open the notebook in Databricks workspace
  2. Attach to a cluster
  3. Adjust widget parameters if needed
  4. Run cells interactively

Option 2: Run via Asset Bundle

Critical: Include .py extension in notebook path:

yaml
notebook_task:
  notebook_path: ../src/exploration/adhoc_exploration.py  # ✅ Include .py extension
  base_parameters:
    catalog: ${var.catalog}
bash
# Deploy and run
databricks bundle deploy -t dev
databricks bundle run -t dev adhoc_exploration_job

See Notebook Patterns for complete YAML example.

Option 3: Direct Execution via %run

python
# In any Databricks notebook, import and use the functions:
%run ./exploration/adhoc_exploration

# Then use the helper functions
list_tables(catalog, bronze_schema)
df = explore_table(catalog, gold_schema, "fact_sales_daily")

Common Mistakes to Avoid

❌ Mistake 1: Missing .serverless() or .clusterId()

python
# ❌ DON'T: Use default builder
spark = DatabricksSession.builder.getOrCreate()
# Exception: Cluster id or serverless are required

✅ Correct: Specify execution environment

python
spark = DatabricksSession.builder.serverless().profile("profile").getOrCreate()

❌ Mistake 2: Forgetting File Extension in Bundle

yaml
# ❌ DON'T: Omit .py extension
notebook_path: ../src/exploration/adhoc_exploration
# Error: notebook not found

✅ Correct: Include .py extension

yaml
notebook_path: ../src/exploration/adhoc_exploration.py

❌ Mistake 3: Using Magic Commands Locally

python
# ❌ DON'T: Use magic commands
%pip install 'databricks-sdk[notebook]'
# SyntaxError: invalid syntax

✅ Correct: Install packages before running

bash
pip install -r requirements.txt

Validation Checklist

File Structure

  • Both .py and .ipynb formats created
  • requirements.txt with dependencies
  • README.md with full documentation
  • QUICKSTART.md with 5-minute setup

.py File (Databricks Workspace)

  • Has # Databricks notebook source header
  • Uses # COMMAND ---------- cell separators
  • Markdown cells use # MAGIC %md
  • Initializes Spark with .serverless() or .clusterId()
  • Uses widget fallback pattern
  • All helper functions defined

.ipynb File (Local Jupyter)

  • Standard Jupyter notebook format
  • No Databricks-specific headers
  • No magic commands
  • Direct variable assignment (no widgets)
  • Setup instructions in markdown cell
  • Same helper functions as .py

Helper Functions

  • list_tables() — Table discovery
  • explore_table() — Schema + samples
  • check_data_quality() — DQ checks
  • compare_tables() — Layer comparison
  • show_table_properties() — Governance metadata

Reference Files

Notebook Patterns

references/notebook-patterns.md — Complete code patterns for both .py and .ipynb formats, helper function implementations, widget fallback patterns, Spark session initialization, dev/prod configuration, %run import patterns, and Asset Bundle integration.

Analysis Workflows

references/analysis-workflows.md — Step-by-step analysis workflows for data discovery, data quality investigation, schema validation, Metric View testing, monitoring metrics review, and cross-layer comparison.

Tips & Troubleshooting

references/tips-and-troubleshooting.md — Visualization tips, summary statistics, performance analysis, extending notebooks with custom functions and matplotlib/plotly, troubleshooting common errors (table not found, permission denied, slow queries, connection issues), and best practices.

Assets

Exploration Notebook Template

assets/templates/exploration-notebook.py — Starter notebook template with all standard helper functions and widget fallback patterns.

Requirements Template

assets/templates/requirements-template.md — Fill-in-the-blank planning template for catalog configuration, environment settings, tables to explore, DQ focus areas, and cross-layer comparisons.

References

Official Databricks Documentation

Related Patterns