Ad-Hoc Exploration Notebook Patterns

Overview

Every data product should include exploration notebooks in two formats to support different development workflows: Databricks workspace format (.py) for interactive use in Databricks UI, and Jupyter format (.ipynb) for local development with Databricks Connect. Magic commands (%pip, %sql, dbutils.library.restartPython()) only work in Databricks workspace, not with Databricks Connect or local execution.

Quick Start (30 minutes)

Goal: Create interactive notebooks for data exploration (works both in Databricks and locally).

What You'll Create:

•adhoc_exploration.py — Databricks workspace version (with magic commands)
•adhoc_exploration.ipynb — Local Jupyter version (Databricks Connect)
•Standard helper functions (list tables, explore, check quality, compare, show properties)

Before You Begin: Fill in the Requirements Template with your catalog, schemas, and key tables.

Fast Track: Databricks Workspace

python

# 1. Upload adhoc_exploration.py to Databricks workspace
# 2. Run setup cells (installs databricks-sdk)
# 3. Use helper functions:

list_tables(catalog, "my_schema")
explore_table(catalog, "my_schema", "fact_sales_daily")
check_data_quality(catalog, "my_schema", "fact_sales_daily")
compare_tables(catalog, "silver_schema", "silver_transactions",
               catalog, "gold_schema", "fact_sales_daily")

Fast Track: Local Jupyter

bash

# 1. Install requirements
pip install -r requirements.txt

# 2. Configure Databricks Connect
databricks configure --profile my_profile

# 3. Launch Jupyter and open adhoc_exploration.ipynb
jupyter lab

Key Differences:

•Databricks (.py): Uses magic commands (%pip, %sql), dbutils.widgets.text() for params
•Local (.ipynb): Pure Python, direct variable assignment, requires Databricks Connect

When to Use This Skill

•Building data exploration tools for Bronze/Silver/Gold layers
•Debugging data quality issues
•Creating interactive analysis notebooks
•Validating schema changes
•Testing query patterns before production
•Testing Metric Views and monitoring tables
•Cross-layer data flow validation (Bronze → Silver → Gold)

Critical Rules

1. Dual-Format Requirement

CRITICAL: Create both .py (Databricks workspace) and .ipynb (local Jupyter) formats.

Why: Magic commands (%pip, %sql, dbutils.library.restartPython()) only work in Databricks workspace, not with Databricks Connect.

2. Spark Session Initialization

CRITICAL: Must specify either .serverless() or .clusterId() — default will fail:

python

# ✅ CORRECT
spark = DatabricksSession.builder.serverless().profile("profile").getOrCreate()
# OR
spark = DatabricksSession.builder.clusterId("cluster-id").getOrCreate()

# ❌ WRONG
spark = DatabricksSession.builder.getOrCreate()  # Exception: Cluster id or serverless required

3. Widget Fallback Pattern

Always use widget fallback pattern for cross-environment compatibility:

python

# Default values
catalog = "default_catalog"

# Try widgets (Databricks workspace), fall back to defaults
try:
    dbutils.widgets.text("catalog", catalog, "Catalog")
    catalog = dbutils.widgets.get("catalog")
except Exception:
    # Widgets not available (Databricks Connect)
    pass

Quick Reference

File Structure

code

src/exploration/
├── adhoc_exploration.py              # Databricks workspace format
├── adhoc_exploration.ipynb           # Local Jupyter format
├── requirements.txt                  # Python dependencies
├── README.md                         # Full documentation
└── QUICKSTART.md                     # 5-minute setup guide

Standard Helper Functions

Every exploration notebook should include:

•Table Discovery: list_tables(catalog_name, schema_name) — List all tables with metadata
•Table Exploration: explore_table(catalog_name, schema_name, table_name, limit) — Schema + sample data + row count
•Data Quality: check_data_quality(catalog_name, schema_name, table_name) — Null counts, duplicates
•Comparison: compare_tables(catalog1, schema1, table1, catalog2, schema2, table2) — Compare across layers
•Governance: show_table_properties(catalog_name, schema_name, table_name) — View metadata, tags, TBLPROPERTIES

See Notebook Patterns for complete implementations.

Core Patterns

Databricks Workspace Format (.py)

Key Features:

•# Databricks notebook source header
•# COMMAND ---------- cell separators
•# MAGIC %md for markdown cells
•Widget support with fallback pattern
•Magic commands work (%pip, %sql)

See Notebook Patterns for complete template.

Local Jupyter Format (.ipynb)

Key Features:

•Standard Jupyter notebook format
•No Databricks-specific headers
•No magic commands
•Direct variable assignment (no widgets)
•Setup instructions in markdown cell

Requirements:

txt

databricks-sdk[notebook]>=0.28.0
pyspark>=3.5.0

Multiple Execution Options

Option 1: Run via Databricks UI

•Open the notebook in Databricks workspace
•Attach to a cluster
•Adjust widget parameters if needed
•Run cells interactively

Option 2: Run via Asset Bundle

Critical: Include .py extension in notebook path:

yaml

notebook_task:
  notebook_path: ../src/exploration/adhoc_exploration.py  # ✅ Include .py extension
  base_parameters:
    catalog: ${var.catalog}

bash

# Deploy and run
databricks bundle deploy -t dev
databricks bundle run -t dev adhoc_exploration_job

See Notebook Patterns for complete YAML example.

Option 3: Direct Execution via %run

python

# In any Databricks notebook, import and use the functions:
%run ./exploration/adhoc_exploration

# Then use the helper functions
list_tables(catalog, bronze_schema)
df = explore_table(catalog, gold_schema, "fact_sales_daily")

Common Mistakes to Avoid

❌ Mistake 1: Missing .serverless() or .clusterId()

python

# ❌ DON'T: Use default builder
spark = DatabricksSession.builder.getOrCreate()
# Exception: Cluster id or serverless are required

✅ Correct: Specify execution environment

python

spark = DatabricksSession.builder.serverless().profile("profile").getOrCreate()

❌ Mistake 2: Forgetting File Extension in Bundle

yaml

# ❌ DON'T: Omit .py extension
notebook_path: ../src/exploration/adhoc_exploration
# Error: notebook not found

✅ Correct: Include .py extension

yaml

notebook_path: ../src/exploration/adhoc_exploration.py

❌ Mistake 3: Using Magic Commands Locally

python

# ❌ DON'T: Use magic commands
%pip install 'databricks-sdk[notebook]'
# SyntaxError: invalid syntax

✅ Correct: Install packages before running

bash

pip install -r requirements.txt

Validation Checklist

File Structure

• Both .py and .ipynb formats created
• requirements.txt with dependencies
• README.md with full documentation
• QUICKSTART.md with 5-minute setup

.py File (Databricks Workspace)

• Has # Databricks notebook source header
• Uses # COMMAND ---------- cell separators
• Markdown cells use # MAGIC %md
• Initializes Spark with .serverless() or .clusterId()
• Uses widget fallback pattern
• All helper functions defined

.ipynb File (Local Jupyter)

• Standard Jupyter notebook format
• No Databricks-specific headers
• No magic commands
• Direct variable assignment (no widgets)
• Setup instructions in markdown cell
• Same helper functions as .py

Helper Functions

• list_tables() — Table discovery
• explore_table() — Schema + samples
• check_data_quality() — DQ checks
• compare_tables() — Layer comparison
• show_table_properties() — Governance metadata

Reference Files

Notebook Patterns

references/notebook-patterns.md — Complete code patterns for both .py and .ipynb formats, helper function implementations, widget fallback patterns, Spark session initialization, dev/prod configuration, %run import patterns, and Asset Bundle integration.

Analysis Workflows

references/analysis-workflows.md — Step-by-step analysis workflows for data discovery, data quality investigation, schema validation, Metric View testing, monitoring metrics review, and cross-layer comparison.

Tips & Troubleshooting

references/tips-and-troubleshooting.md — Visualization tips, summary statistics, performance analysis, extending notebooks with custom functions and matplotlib/plotly, troubleshooting common errors (table not found, permission denied, slow queries, connection issues), and best practices.

Assets

Exploration Notebook Template

assets/templates/exploration-notebook.py — Starter notebook template with all standard helper functions and widget fallback patterns.

Requirements Template

assets/templates/requirements-template.md — Fill-in-the-blank planning template for catalog configuration, environment settings, tables to explore, DQ focus areas, and cross-layer comparisons.

References

Official Databricks Documentation

Related Patterns

•Databricks Asset Bundles — Deployment patterns
•Python File Imports — Sharing code between notebooks
•DLT Expectations — DQ rules and validation
•Metric Views — Metric View testing patterns
•Observability Setup — Monitoring table patterns