Gold Layer Schema Validation Patterns
Overview
Gold layer merge scripts must align perfectly with DDL-defined schemas. Mismatches between source DataFrame columns and target table schemas cause 59% of Gold layer bugs. This skill provides validation patterns to catch schema issues before deployment.
Key Principle: DDL (setup script) is the runtime source of truth, not YAML design specifications.
When to Use This Skill
Use this skill when:
- •Creating Gold layer merge scripts from Silver to Gold
- •Ensuring DataFrame columns match target table schemas
- •Validating column mappings before merge operations
- •Catching schema issues before deployment
- •Debugging
UNRESOLVED_COLUMNerrors in merge operations
Core Problem: Three Sources of Truth
Gold layer development involves three schema definitions:
- •YAML Design (
gold_layer_design/yaml/*.yaml) - Human-readable specifications - •DDL Scripts (
src/gold/setup/*.py) - CREATE TABLE statements - •Merge Scripts (
src/gold/merge/*.py) - DataFrame transformations
Problem: Manual transcription between these creates mismatches.
Common Failure Pattern
# YAML Design: user_email
# DDL Script: email (renamed during implementation)
# Merge Script: .select("user_email") # ❌ Still references old name!
# Runtime Error: [UNRESOLVED_COLUMN]
Root Cause: DDL was updated, but merge script wasn't.
Critical Rules
Rule 1: DDL is Runtime Source of Truth
Always develop merge scripts against actual created tables, not YAML designs.
Workflow:
1. YAML Design (Planning) ↓ 2. DDL Creation (Reality) ← SOURCE OF TRUTH ↓ 3. Run Setup Job (Create Tables) ↓ 4. Validate Schema (Read actual table) ↓ 5. Write Merge Script (Against actual schema) ↓ 6. Pre-Deployment Validation (Compare DataFrame vs DDL)
Rule 2: Always Validate Before Merge
Use schema validation helper before every merge operation:
from helpers import validate_merge_schema
# Prepare updates
updates_df = prepare_transformations(silver_df)
# ✅ Validate schema BEFORE merge
validate_merge_schema(
spark, updates_df, catalog, gold_schema, "dim_user",
raise_on_mismatch=True # Fail fast if mismatch
)
# Proceed with merge
delta_gold.merge(updates_df, ...)
Rule 3: Explicit Column Mapping
Never assume Silver column names match Gold DDL. Always use explicit mapping:
# ✅ CORRECT: Explicit mapping
updates_df = (
silver_df
.withColumn("store_id", col("store_number")) # Rename explicitly
.withColumn("product_id", col("upc_code")) # Rename explicitly
.select("store_id", "product_id", ...) # Only Gold columns
)
Quick Reference
Schema Validation Helper
validate_merge_schema(
spark: SparkSession,
updates_df: DataFrame,
catalog: str,
schema: str,
table_name: str,
raise_on_mismatch: bool = True
) -> dict
Returns validation result with:
- •
valid: Boolean indicating if schemas match - •
missing_in_df: Columns in DDL but not in DataFrame - •
extra_in_df: Columns in DataFrame but not in DDL - •
type_mismatches: Columns with type differences
See references/validation-patterns.md for complete implementation with full code examples.
Pre-Merge Validation Checklist
Before writing any Gold merge script:
Pre-Development:
- • Read YAML design to understand intent
- • Run setup job to create actual tables
- • Read DDL schema from actual table (
DESCRIBE table) - • Document column mappings (Silver → Gold)
- • Identify any renamed columns
During Development:
- • Use explicit
.withColumn()for all renames - • Use
.select()with explicit column list (no*) - • Add comments documenting mappings
- • Cast data types explicitly if DDL differs from source
- • Include all DDL columns in final DataFrame
Pre-Deployment:
- • Run schema validation helper
- • Verify no extra columns in DataFrame
- • Verify no missing columns from DDL
- • Verify data types match DDL
- • Test merge with small sample data
Core Patterns
Pattern 1: Schema Inspector Helper
Validate DataFrame against target table schema before merge:
# See references/validation-patterns.md for full implementation from helpers import validate_merge_schema validate_merge_schema(spark, updates_df, catalog, schema, table_name)
Benefits:
- •✅ Catches mismatches before Delta merge attempt
- •✅ Clear error messages with exact column differences
- •✅ Prevents cryptic Delta errors
- •✅ Validates data types, not just column names
Pattern 2: Explicit Column Mapping
Document and implement explicit Silver → Gold column mappings:
def merge_dim_store(spark, catalog, silver_schema, gold_schema):
"""
Column Mapping (Silver → Gold):
- store_number → store_number (same)
- company_rcn → company_retail_control_number (renamed)
- processed_timestamp → record_updated_timestamp (renamed)
"""
updates_df = (
silver_df
.withColumn("company_retail_control_number", col("company_rcn"))
.withColumn("record_updated_timestamp", col("processed_timestamp"))
.select("store_number", "company_retail_control_number", ...)
)
Pattern 3: DDL Schema Reader
Read actual table schema before writing merge script:
# Step 1: Read actual schema
schema = spark.table(f"{catalog}.{gold_schema}.dim_user").schema
print([field.name for field in schema.fields]) # Verify column names
# Step 2: Write merge script using actual columns
updates_df = df.select("email") # ✅ Matches DDL, not YAML
Common Mistakes to Avoid
❌ Mistake 1: Trusting YAML Over DDL
# ❌ BAD: Using YAML-defined column name
updates_df = df.select("user_email") # Doesn't match DDL!
Fix: Always verify against actual DDL:
# ✅ GOOD: Check actual table schema first
schema = spark.table("catalog.schema.dim_user").schema
updates_df = df.select("email") # ✅ Matches DDL
❌ Mistake 2: Implicit Column Selection
# ❌ BAD: Selecting with *
updates_df = silver_df.select("*") # Includes all Silver columns
Fix: Explicit selection:
# ✅ GOOD: Only select columns that exist in Gold DDL
updates_df = silver_df.select("column1", "column2", "column3")
❌ Mistake 3: No Pre-Merge Validation
# ❌ BAD: Merge without validation delta_gold.merge(updates_df, ...) # Fails at runtime with cryptic error
Fix: Validate first:
# ✅ GOOD: Validate before merge validate_merge_schema(spark, updates_df, catalog, schema, table_name) delta_gold.merge(updates_df, ...) # Clear error if mismatch
Reference Files
- •
references/validation-patterns.md- Complete validation SQL patterns, schema inspector implementation, DDL reader patterns, column mapping documentation patterns, and detailed validation workflows - •
scripts/validate_schema.py- Pre-deployment validation script for comparing all merge DataFrames against DDL schemas
Scripts
validate_schema.py
Pre-deployment validation script that compares all merge DataFrames against DDL schemas. Run before databricks bundle deploy to catch schema mismatches.
Usage:
python scripts/validate_schema.py \ prashanth_subrahmanyam_catalog \ dev_prashanth_subrahmanyam_system_gold
See scripts/validate_schema.py for complete implementation.
Related Patterns
- •Fact Table Grain Validation - See
gold/fact-table-grain-validationskill - •Unity Catalog Constraints - See
common/unity-catalog-constraintsskill - •Gold Merge Deduplication - See
gold/gold-delta-merge-deduplicationskill
References
- •Delta Lake Merge - Official documentation
- •Rule Improvement Case Study - Complete Gold layer debugging methodology
Last Updated: December 2, 2025
Pattern Origin: 88 bugs across 7 Gold domains, 26% schema mismatch errors
Key Lesson: DDL is runtime truth, not YAML. Always validate DataFrame against actual table schema before merge.
Impact: Prevents 48% of schema-related bugs through pre-merge validation