AgentSkillsCN

bronze-layer-setup

为测试与演示创建端到端的青铜层。创建表DDL,使用Faker生成假数据,从现有来源复制数据,并配置资产包作业。涵盖Unity Catalog合规性、变更数据馈送、自动液态聚类,以及治理元数据。在设置青铜层表、创建测试/演示数据、快速原型Medallion架构,或启动新的Databricks项目时使用此功能。对于Faker特有的模式(腐败率、函数签名、提供商示例),请加载faker-data-generation技能。

SKILL.md
--- frontmatter
name: bronze-layer-setup
description: End-to-end Bronze layer creation for testing and demos. Creates table DDLs, generates fake data with Faker, copies from existing sources, and configures Asset Bundle jobs. Covers Unity Catalog compliance, Change Data Feed, automatic liquid clustering, and governance metadata. Use when setting up Bronze layer tables, creating test/demo data, rapid prototyping Medallion Architecture, or bootstrapping a new Databricks project. For Faker-specific patterns (corruption rates, function signatures, provider examples), load the faker-data-generation skill.
metadata:
  author: prashanth subrahmanyam
  version: "2.0"
  domain: bronze
  role: orchestrator
  pipeline_stage: 2
  pipeline_stage_name: bronze
  reads:
    - context/*.csv
  next_stages:
    - silver-layer-setup
  workers:
    - faker-data-generation
  common_dependencies:
    - databricks-asset-bundles
    - databricks-expert-agent
    - databricks-python-imports
    - databricks-table-properties
    - schema-management-patterns
    - naming-tagging-standards
    - databricks-autonomous-operations
  last_verified: "2026-02-07"
  volatility: low
  upstream_sources: []  # Internal setup patterns

Bronze Layer Setup

Create Bronze layer tables with test data for rapid prototyping of Medallion Architecture.

When to Use

  • Setting up Bronze layer tables for a new project
  • Creating test/demo data for Silver/Gold layer development
  • Rapid prototyping of Medallion Architecture
  • Bootstrapping a Databricks project with realistic test data
  • Copying data from existing sources to a new Bronze schema

For Faker-specific patterns (corruption rates, function signatures, provider examples), the faker-data-generation worker skill is loaded at Step 4 via the Mandatory Skill Dependencies table below.

Core Philosophy

The Bronze layer in this approach is optimized for testing, demos, and rapid prototyping:

  • Quick setup with realistic test data
  • Faker data generation as the primary method
  • Unity Catalog compliance (proper governance metadata)
  • Change Data Feed enabled for downstream Silver/Gold testing
  • Automatic liquid clustering for query optimization
  • Flexible data sources (generate, copy, or reference existing)
  • NOT for production ingestion (use separate ingestion pipelines for that)

🔴 Non-Negotiable Defaults (Applied to EVERY Bronze Table and Job)

These defaults are ALWAYS applied. There are NO exceptions, NO overrides, NO alternative options.

DefaultValueApplied WhereNEVER Do This Instead
Serverlessenvironments: block with environment_keyEvery job YAML❌ NEVER define job_clusters: or existing_cluster_id:
Environments V4environment_version: "4"Every job's environments.spec❌ NEVER omit or use older versions
Auto Liquid ClusteringCLUSTER BY AUTOEvery CREATE TABLE DDL❌ NEVER use CLUSTER BY (col1, col2) or PARTITIONED BY
Change Data Feed'delta.enableChangeDataFeed' = 'true'Every table's TBLPROPERTIES❌ NEVER omit (required for Silver streaming)
Auto-Optimize'delta.autoOptimize.optimizeWrite' = 'true'Every table's TBLPROPERTIES❌ NEVER omit
notebook_tasknotebook_task: with base_parameters:Every task in job YAML❌ NEVER use python_task: or CLI-style parameters:
sql
-- ✅ CORRECT: Every Bronze table DDL MUST include
CREATE TABLE IF NOT EXISTS {catalog}.{schema}.{table_name} (
    ...
)
USING DELTA
CLUSTER BY AUTO          -- 🔴 MANDATORY
TBLPROPERTIES (
    'delta.enableChangeDataFeed' = 'true',          -- 🔴 MANDATORY
    'delta.autoOptimize.optimizeWrite' = 'true',    -- 🔴 MANDATORY
    'delta.autoOptimize.autoCompact' = 'true',      -- 🔴 MANDATORY
    'layer' = 'bronze'
)
yaml
# ✅ CORRECT: Every Bronze job MUST include
environments:
  - environment_key: "default"
    spec:
      environment_version: "4"   # 🔴 MANDATORY
tasks:
  - task_key: setup_tables
    environment_key: default     # 🔴 MANDATORY on every task
    notebook_task:               # 🔴 MANDATORY (never python_task)
      notebook_path: ../src/setup_tables.py
      base_parameters:           # 🔴 MANDATORY (never CLI-style parameters)
        catalog: ${var.catalog}

Quick Start (30 minutes)

What You'll Create:

  1. setup_tables.py - DDL definitions for all Bronze tables
  2. generate_dimensions.py - Faker-based dimension data generator
  3. generate_facts.py - Faker-based fact data generator (with FK integrity)
  4. bronze_setup_job.yml + bronze_data_generator_job.yml - Asset Bundle jobs

Fast Track:

bash
# 1. Deploy setup job
databricks bundle deploy -t dev

# 2. Create tables
databricks bundle run bronze_setup_job -t dev

# 3. Generate data (dimensions -> facts in sequence)
databricks bundle run bronze_data_generator_job -t dev

Key Decisions:

  • Data Source: Faker (recommended) | Existing tables | External copy
  • Record Counts: Dimensions: 100-200 | Facts: 1,000-10,000
  • Tables Needed: 5-10 tables (dimensions + facts)

Output: Bronze Delta tables with Change Data Feed enabled, ready for Silver layer testing

Workflow

Step 1: Gather Requirements (15 min)

Fill in the requirements template: references/requirements-template.md

  • Project name, entity list (5-10 tables), data source approach
  • Domain taxonomy, data classification, record counts
  • Business/technical ownership

Step 2: Choose Data Source Approach

Three approaches detailed in references/data-source-approaches.md:

ApproachBest ForTime
A: Schema CSV + Faker (recommended)Create Bronze tables matching customer's source schema from context/*.csv, then populate with Faker-generated data30-45 min
B: Existing tablesData already exists in Databricks — read it directly15-20 min
C: External copyCopy data from sample datasets, CSVs, databases, or other workspaces20-30 min

Approach A (Schema CSV + Faker) is the standard approach for this framework. It reads the customer's source schema CSV from context/ to create Bronze DDLs with the exact same table structure, then generates Faker data matching those column types and FK relationships. This ensures Bronze faithfully represents the customer's source system.

python
# ✅ CORRECT: Read schema CSV to build Bronze DDLs
import csv
from pathlib import Path

def extract_tables_from_schema_csv(csv_path: Path) -> dict:
    """Extract table definitions from customer schema CSV."""
    from collections import defaultdict
    tables = defaultdict(list)
    with open(csv_path) as f:
        reader = csv.DictReader(f)
        for row in reader:
            tables[row["table_name"]].append({
                "name": row["column_name"],
                "type": row.get("full_data_type", row.get("data_type", "STRING")),
                "nullable": row.get("is_nullable", "YES") == "YES",
                "comment": row.get("comment", ""),
            })
    return dict(tables)

# Extract tables from customer schema
schema_tables = extract_tables_from_schema_csv(Path("context/Wanderbricks_Schema.csv"))

# Generate DDL for each table — names and types come from CSV, never hardcoded
for table_name, columns in schema_tables.items():
    col_defs = ", ".join(f"{c['name']} {c['type']}" for c in columns)
    ddl = f"CREATE TABLE IF NOT EXISTS {{catalog}}.{{schema}}.{table_name} ({col_defs}) ..."
python
# ❌ WRONG: Hardcoding table definitions instead of extracting from schema CSV
tables = {
    "bookings": ["booking_id BIGINT", "user_id BIGINT", ...],  # ❌ Might be incomplete
    "users": ["user_id BIGINT", "email STRING", ...],           # ❌ Might have wrong types
}

Step 3: Create Table DDLs (30 min)

Use the setup script template: scripts/setup_tables.py

File structure to create:

code
src/{project}_bronze/
├── __init__.py                # Package initialization
├── setup_tables.py            # Table DDL definitions
├── generate_dimensions.py     # Generate dimension data with Faker
├── generate_facts.py          # Generate fact data with Faker
└── copy_from_source.py        # Optional: Copy from existing source

Critical DDL rules:

  • CLUSTER BY AUTO on all tables (never specify columns manually)
  • delta.enableChangeDataFeed = true (required for Silver)
  • Standard audit columns: ingestion_timestamp, source_file
  • Mark tables as data_purpose = testing_demo, is_production = false

Step 4: Generate or Load Data (30-45 min)

Option A (Faker): Use the faker-data-generation skill for patterns.

  • Generate dimensions first (for FK integrity)
  • Generate facts with references to dimension keys
  • Use seeded Faker for reproducibility

Option B/C (Copy): Use the copy script template: scripts/copy_from_source.py

Step 5: Configure Asset Bundle Jobs (15 min)

Use the job templates:

Step 6: Deploy & Validate (15 min)

Run validation queries: references/validation-queries.md

Critical Rules

Required TBLPROPERTIES

Every Bronze table must include:

sql
TBLPROPERTIES (
    'delta.enableChangeDataFeed' = 'true',
    'delta.autoOptimize.optimizeWrite' = 'true',
    'delta.autoOptimize.autoCompact' = 'true',
    'layer' = 'bronze',
    'source_system' = '{source}',
    'domain' = '{domain}',
    'entity_type' = '{dimension|fact}',
    'contains_pii' = '{true|false}',
    'data_classification' = '{confidential|internal|public}',
    'business_owner' = '{team}',
    'technical_owner' = 'Data Engineering',
    'data_purpose' = 'testing_demo',
    'is_production' = 'false'
)

Table Naming Convention

  • Dimensions: bronze_{entity}_dim (e.g., bronze_store_dim, bronze_product_dim)
  • Facts: bronze_{entity} (e.g., bronze_transactions, bronze_inventory)
  • Date dimension: bronze_date_dim (SQL-generated, not Faker)

Data Generation Order

  1. Dimensions first - Create master data tables
  2. Date dimension - Generated via SQL SEQUENCE (not Faker)
  3. Facts last - Load dimension keys for FK integrity

Mandatory Skill Dependencies

MANDATORY: Read each skill below using the Read tool BEFORE writing any code for the indicated step. Do NOT generate these patterns from memory.

StepRead Skill (MANDATORY)What It Provides
All stepsdata_product_accelerator/skills/common/databricks-expert-agent/SKILL.mdCore extraction principle: extract names from source, never hardcode
Step 3 (DDLs)data_product_accelerator/skills/common/databricks-table-properties/SKILL.mdBronze TBLPROPERTIES, CLUSTER BY AUTO, governance metadata
Step 3 (DDLs)data_product_accelerator/skills/common/schema-management-patterns/SKILL.mdCREATE SCHEMA IF NOT EXISTS, Predictive Optimization
Step 4 (Data)data_product_accelerator/skills/bronze/01-faker-data-generation/SKILL.mdFaker corruption patterns, function signatures, provider examples
Step 5 (Jobs)data_product_accelerator/skills/common/databricks-asset-bundles/SKILL.mdServerless job YAML, Environments V4, notebook_task, base_parameters
Step 5 (Jobs)data_product_accelerator/skills/common/databricks-python-imports/SKILL.mdPure Python import patterns for notebook code sharing
Troubleshootingdata_product_accelerator/skills/common/databricks-autonomous-operations/SKILL.mdDeploy → Poll → Diagnose → Fix → Redeploy loop when jobs fail

NEVER do these without FIRST reading the corresponding skill:

  • NEVER write TBLPROPERTIES without reading databricks-table-properties
  • NEVER write Faker generators without reading faker-data-generation
  • NEVER write Asset Bundle YAML without reading databricks-asset-bundles
  • NEVER write CREATE SCHEMA without reading schema-management-patterns

Reference Files

Scripts

Asset Templates

Pipeline Progression

Previous stage: gold/00-gold-layer-design → Gold layer design must be complete so that the target dimensional model is understood before creating Bronze tables

Next stage: After completing the Bronze layer, proceed to:

  • silver/00-silver-layer-setup — Set up Silver layer DLT pipelines with data quality rules

References