Bronze Layer Setup
Create Bronze layer tables with test data for rapid prototyping of Medallion Architecture.
When to Use
- •Setting up Bronze layer tables for a new project
- •Creating test/demo data for Silver/Gold layer development
- •Rapid prototyping of Medallion Architecture
- •Bootstrapping a Databricks project with realistic test data
- •Copying data from existing sources to a new Bronze schema
For Faker-specific patterns (corruption rates, function signatures, provider examples), the faker-data-generation worker skill is loaded at Step 4 via the Mandatory Skill Dependencies table below.
Core Philosophy
The Bronze layer in this approach is optimized for testing, demos, and rapid prototyping:
- •Quick setup with realistic test data
- •Faker data generation as the primary method
- •Unity Catalog compliance (proper governance metadata)
- •Change Data Feed enabled for downstream Silver/Gold testing
- •Automatic liquid clustering for query optimization
- •Flexible data sources (generate, copy, or reference existing)
- •NOT for production ingestion (use separate ingestion pipelines for that)
🔴 Non-Negotiable Defaults (Applied to EVERY Bronze Table and Job)
These defaults are ALWAYS applied. There are NO exceptions, NO overrides, NO alternative options.
| Default | Value | Applied Where | NEVER Do This Instead |
|---|---|---|---|
| Serverless | environments: block with environment_key | Every job YAML | ❌ NEVER define job_clusters: or existing_cluster_id: |
| Environments V4 | environment_version: "4" | Every job's environments.spec | ❌ NEVER omit or use older versions |
| Auto Liquid Clustering | CLUSTER BY AUTO | Every CREATE TABLE DDL | ❌ NEVER use CLUSTER BY (col1, col2) or PARTITIONED BY |
| Change Data Feed | 'delta.enableChangeDataFeed' = 'true' | Every table's TBLPROPERTIES | ❌ NEVER omit (required for Silver streaming) |
| Auto-Optimize | 'delta.autoOptimize.optimizeWrite' = 'true' | Every table's TBLPROPERTIES | ❌ NEVER omit |
| notebook_task | notebook_task: with base_parameters: | Every task in job YAML | ❌ NEVER use python_task: or CLI-style parameters: |
-- ✅ CORRECT: Every Bronze table DDL MUST include
CREATE TABLE IF NOT EXISTS {catalog}.{schema}.{table_name} (
...
)
USING DELTA
CLUSTER BY AUTO -- 🔴 MANDATORY
TBLPROPERTIES (
'delta.enableChangeDataFeed' = 'true', -- 🔴 MANDATORY
'delta.autoOptimize.optimizeWrite' = 'true', -- 🔴 MANDATORY
'delta.autoOptimize.autoCompact' = 'true', -- 🔴 MANDATORY
'layer' = 'bronze'
)
# ✅ CORRECT: Every Bronze job MUST include
environments:
- environment_key: "default"
spec:
environment_version: "4" # 🔴 MANDATORY
tasks:
- task_key: setup_tables
environment_key: default # 🔴 MANDATORY on every task
notebook_task: # 🔴 MANDATORY (never python_task)
notebook_path: ../src/setup_tables.py
base_parameters: # 🔴 MANDATORY (never CLI-style parameters)
catalog: ${var.catalog}
Quick Start (30 minutes)
What You'll Create:
- •
setup_tables.py- DDL definitions for all Bronze tables - •
generate_dimensions.py- Faker-based dimension data generator - •
generate_facts.py- Faker-based fact data generator (with FK integrity) - •
bronze_setup_job.yml+bronze_data_generator_job.yml- Asset Bundle jobs
Fast Track:
# 1. Deploy setup job databricks bundle deploy -t dev # 2. Create tables databricks bundle run bronze_setup_job -t dev # 3. Generate data (dimensions -> facts in sequence) databricks bundle run bronze_data_generator_job -t dev
Key Decisions:
- •Data Source: Faker (recommended) | Existing tables | External copy
- •Record Counts: Dimensions: 100-200 | Facts: 1,000-10,000
- •Tables Needed: 5-10 tables (dimensions + facts)
Output: Bronze Delta tables with Change Data Feed enabled, ready for Silver layer testing
Workflow
Step 1: Gather Requirements (15 min)
Fill in the requirements template: references/requirements-template.md
- •Project name, entity list (5-10 tables), data source approach
- •Domain taxonomy, data classification, record counts
- •Business/technical ownership
Step 2: Choose Data Source Approach
Three approaches detailed in references/data-source-approaches.md:
| Approach | Best For | Time |
|---|---|---|
| A: Schema CSV + Faker (recommended) | Create Bronze tables matching customer's source schema from context/*.csv, then populate with Faker-generated data | 30-45 min |
| B: Existing tables | Data already exists in Databricks — read it directly | 15-20 min |
| C: External copy | Copy data from sample datasets, CSVs, databases, or other workspaces | 20-30 min |
Approach A (Schema CSV + Faker) is the standard approach for this framework. It reads the customer's source schema CSV from context/ to create Bronze DDLs with the exact same table structure, then generates Faker data matching those column types and FK relationships. This ensures Bronze faithfully represents the customer's source system.
# ✅ CORRECT: Read schema CSV to build Bronze DDLs
import csv
from pathlib import Path
def extract_tables_from_schema_csv(csv_path: Path) -> dict:
"""Extract table definitions from customer schema CSV."""
from collections import defaultdict
tables = defaultdict(list)
with open(csv_path) as f:
reader = csv.DictReader(f)
for row in reader:
tables[row["table_name"]].append({
"name": row["column_name"],
"type": row.get("full_data_type", row.get("data_type", "STRING")),
"nullable": row.get("is_nullable", "YES") == "YES",
"comment": row.get("comment", ""),
})
return dict(tables)
# Extract tables from customer schema
schema_tables = extract_tables_from_schema_csv(Path("context/Wanderbricks_Schema.csv"))
# Generate DDL for each table — names and types come from CSV, never hardcoded
for table_name, columns in schema_tables.items():
col_defs = ", ".join(f"{c['name']} {c['type']}" for c in columns)
ddl = f"CREATE TABLE IF NOT EXISTS {{catalog}}.{{schema}}.{table_name} ({col_defs}) ..."
# ❌ WRONG: Hardcoding table definitions instead of extracting from schema CSV
tables = {
"bookings": ["booking_id BIGINT", "user_id BIGINT", ...], # ❌ Might be incomplete
"users": ["user_id BIGINT", "email STRING", ...], # ❌ Might have wrong types
}
Step 3: Create Table DDLs (30 min)
Use the setup script template: scripts/setup_tables.py
File structure to create:
src/{project}_bronze/
├── __init__.py # Package initialization
├── setup_tables.py # Table DDL definitions
├── generate_dimensions.py # Generate dimension data with Faker
├── generate_facts.py # Generate fact data with Faker
└── copy_from_source.py # Optional: Copy from existing source
Critical DDL rules:
- •
CLUSTER BY AUTOon all tables (never specify columns manually) - •
delta.enableChangeDataFeed = true(required for Silver) - •Standard audit columns:
ingestion_timestamp,source_file - •Mark tables as
data_purpose = testing_demo,is_production = false
Step 4: Generate or Load Data (30-45 min)
Option A (Faker): Use the faker-data-generation skill for patterns.
- •Generate dimensions first (for FK integrity)
- •Generate facts with references to dimension keys
- •Use seeded Faker for reproducibility
Option B/C (Copy): Use the copy script template: scripts/copy_from_source.py
Step 5: Configure Asset Bundle Jobs (15 min)
Use the job templates:
- •assets/templates/bronze-setup-job.yaml - Table creation job
- •assets/templates/bronze-data-generator-job.yaml - Data generation job
Step 6: Deploy & Validate (15 min)
Run validation queries: references/validation-queries.md
Critical Rules
Required TBLPROPERTIES
Every Bronze table must include:
TBLPROPERTIES (
'delta.enableChangeDataFeed' = 'true',
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true',
'layer' = 'bronze',
'source_system' = '{source}',
'domain' = '{domain}',
'entity_type' = '{dimension|fact}',
'contains_pii' = '{true|false}',
'data_classification' = '{confidential|internal|public}',
'business_owner' = '{team}',
'technical_owner' = 'Data Engineering',
'data_purpose' = 'testing_demo',
'is_production' = 'false'
)
Table Naming Convention
- •Dimensions:
bronze_{entity}_dim(e.g.,bronze_store_dim,bronze_product_dim) - •Facts:
bronze_{entity}(e.g.,bronze_transactions,bronze_inventory) - •Date dimension:
bronze_date_dim(SQL-generated, not Faker)
Data Generation Order
- •Dimensions first - Create master data tables
- •Date dimension - Generated via SQL SEQUENCE (not Faker)
- •Facts last - Load dimension keys for FK integrity
Mandatory Skill Dependencies
MANDATORY: Read each skill below using the Read tool BEFORE writing any code for the indicated step. Do NOT generate these patterns from memory.
| Step | Read Skill (MANDATORY) | What It Provides |
|---|---|---|
| All steps | data_product_accelerator/skills/common/databricks-expert-agent/SKILL.md | Core extraction principle: extract names from source, never hardcode |
| Step 3 (DDLs) | data_product_accelerator/skills/common/databricks-table-properties/SKILL.md | Bronze TBLPROPERTIES, CLUSTER BY AUTO, governance metadata |
| Step 3 (DDLs) | data_product_accelerator/skills/common/schema-management-patterns/SKILL.md | CREATE SCHEMA IF NOT EXISTS, Predictive Optimization |
| Step 4 (Data) | data_product_accelerator/skills/bronze/01-faker-data-generation/SKILL.md | Faker corruption patterns, function signatures, provider examples |
| Step 5 (Jobs) | data_product_accelerator/skills/common/databricks-asset-bundles/SKILL.md | Serverless job YAML, Environments V4, notebook_task, base_parameters |
| Step 5 (Jobs) | data_product_accelerator/skills/common/databricks-python-imports/SKILL.md | Pure Python import patterns for notebook code sharing |
| Troubleshooting | data_product_accelerator/skills/common/databricks-autonomous-operations/SKILL.md | Deploy → Poll → Diagnose → Fix → Redeploy loop when jobs fail |
NEVER do these without FIRST reading the corresponding skill:
- •NEVER write
TBLPROPERTIESwithout readingdatabricks-table-properties - •NEVER write Faker generators without reading
faker-data-generation - •NEVER write Asset Bundle YAML without reading
databricks-asset-bundles - •NEVER write
CREATE SCHEMAwithout readingschema-management-patterns
Reference Files
- •references/requirements-template.md - Fill-in template for project requirements, entity list, data classification, ownership
- •references/data-source-approaches.md - Detailed patterns for all 3 data source approaches (Faker, Existing, Copy)
- •references/validation-queries.md - Validation SQL queries and implementation checklist
Scripts
- •scripts/setup_tables.py - Template table creation notebook with DDL patterns, audit columns, governance metadata
- •scripts/copy_from_source.py - Template copy-from-source notebook for Approach B/C
Asset Templates
- •assets/templates/bronze-setup-job.yaml - Asset Bundle job for table creation
- •assets/templates/bronze-data-generator-job.yaml - Asset Bundle job for Faker data generation with dependency chain
Pipeline Progression
Previous stage: gold/00-gold-layer-design → Gold layer design must be complete so that the target dimensional model is understood before creating Bronze tables
Next stage: After completing the Bronze layer, proceed to:
- •
silver/00-silver-layer-setup— Set up Silver layer DLT pipelines with data quality rules