Faker Data Generation Patterns

Overview

When generating synthetic data for Databricks Bronze layer tables, use Faker with configurable data corruption to test Silver layer data quality expectations.

When to Use This Skill

Use when:

•Creating test data for data quality validation
•Testing DLT expectations with intentional violations
•Simulating production-like datasets for development/staging
•Validating referential integrity between dimensions and facts

Core Principles

•Realistic Data: Use Faker with non-linear distributions and temporal patterns
•Referential Integrity: Maintain proper FK relationships between dimensions and facts
•Configurable Corruption: Add intentional data quality issues for testing
•DQ Mapping: Each corruption type maps to specific DLT expectations
•Row Coherence: Attributes within a row must correlate logically
•Raw Data Only: Generate transactional records -- aggregation happens in Gold
•Reproducible: Always seed both np.random.seed() and Faker.seed()
•Documentation: Document corruption patterns and their DQ impacts

Critical Rules

Standard Function Signature

python

def generate_<entity>_data(
    dimension_keys: dict,
    num_records: int = 1000,
    corruption_rate: float = 0.05
) -> list:
    """
    Generate fake <entity> data with realistic patterns.
    
    Args:
        dimension_keys: Dictionary containing dimension keys for referential integrity
        num_records: Number of records to generate
        corruption_rate: Percentage of records to intentionally corrupt (0.0 to 1.0)
        
    Returns:
        List of <entity> dictionaries
    """
    fake = Faker()
    records = []
    
    print(f"\nGenerating {num_records} <entities> (corruption rate: {corruption_rate*100}%)")
    
    for i in range(num_records):
        # Generate valid data first
        record_data = generate_valid_record(fake, dimension_keys)
        
        # Apply corruption if selected
        should_corrupt = random.random() < corruption_rate
        
        if should_corrupt:
            record_data = apply_corruption(record_data, corruption_rate)
        
        records.append(record_data)
    
    return records

🔴 MANDATORY: Seed for Reproducibility

EVERY generation script MUST seed both numpy and Faker:

python

import numpy as np
from faker import Faker

SEED = 42
np.random.seed(SEED)
Faker.seed(SEED)
fake = Faker()

Why: Without seeding, re-running generation produces different data, making debugging impossible and breaking snapshot tests.

🔴 MANDATORY: Non-Linear Distributions

NEVER use random.uniform() for values. Real data is never uniformly distributed:

python

# ❌ WRONG - Uniform (unrealistic)
prices = [random.uniform(10, 1000) for _ in range(N)]

# ✅ CORRECT - Log-normal for monetary values (prices, salaries, amounts)
prices = np.random.lognormal(mean=4.5, sigma=0.8, size=N)

# ✅ CORRECT - Exponential for durations (resolution time, session length)
durations = np.random.exponential(scale=24, size=N)

# ✅ CORRECT - Weighted categorical (not equal probability)
regions = np.random.choice(
    ['North', 'South', 'East', 'West'],
    size=N, p=[0.40, 0.25, 0.20, 0.15]
)

🔴 MANDATORY: Dynamic Date Range (Last 6 Months)

python

from datetime import datetime, timedelta

END_DATE = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0)
START_DATE = END_DATE - timedelta(days=180)

Why: Ensures data feels current for demos and dashboards, with enough history for trend analysis.

🔴 MANDATORY: Row Coherence

Attributes within a row MUST correlate logically:

python

# ✅ CORRECT - tier drives amount, priority, and behavior
if tier == 'Enterprise':
    amount = np.random.lognormal(7, 0.8)        # Higher amounts
    priority = np.random.choice(['Critical', 'High', 'Medium'], p=[0.3, 0.5, 0.2])
else:
    amount = np.random.lognormal(3.5, 0.6)      # Lower amounts
    priority = np.random.choice(['High', 'Medium', 'Low'], p=[0.2, 0.5, 0.3])

# ❌ WRONG - independent random values (no correlation)
amount = random.uniform(10, 10000)  # Amount unrelated to tier
priority = random.choice(['Critical', 'High', 'Medium', 'Low'])  # Random priority

🔴 MANDATORY: Raw Data Only (No Pre-Aggregated Fields)

Generate one row per event/transaction. NEVER add aggregated columns:

python

# ❌ WRONG - pre-aggregated fields (aggregation belongs in Gold layer)
{"customer_id": cid, "total_orders": 47, "total_revenue": 12500.00, "avg_order_value": 265.95}

# ✅ CORRECT - one row per transaction
{"order_id": "ORD-000001", "customer_id": cid, "amount": 150.00, "order_date": "2025-10-15"}

Why: The Medallion pipeline (Silver DLT → Gold MERGE) computes aggregations downstream.

🔴 MANDATORY: Weighted Sampling for Facts

Dimension characteristics MUST drive fact generation volume and behavior:

python

# Build weighted lookup from dimensions
tier_weights = customers_pdf["tier"].map({'Enterprise': 5.0, 'Pro': 2.0, 'Free': 1.0})
customer_weights = (tier_weights / tier_weights.sum()).tolist()
customer_ids = customers_pdf["customer_id"].tolist()

# Enterprise customers generate 5x more events than Free
cid = np.random.choice(customer_ids, p=customer_weights)

Corruption Pattern Structure

python

# Determine if this record should be corrupted for DQ testing
should_corrupt = random.random() < corruption_rate

if should_corrupt:
    # Apply various DQ violations to test expectations
    corruption_type = random.choice([
        'corruption_type_1',
        'corruption_type_2',
        'corruption_type_3',
    ])
    
    if corruption_type == 'corruption_type_1':
        # Will fail: <expectation_name>
        field = invalid_value  # Description of violation

Comments Must Include

•Corruption type name: Descriptive identifier
•DQ expectation failed: Which expectation(s) this triggers
•Violation description: What makes the data invalid

Parameter Handling

Function Parameters

python

def get_parameters():
    """Get parameters from notebook widgets or command line."""
    try:
        # Try Databricks widgets first (notebook mode)
        catalog = dbutils.widgets.get("catalog")
        schema = dbutils.widgets.get("schema")
        num_records = int(dbutils.widgets.get("num_records"))
        corruption_rate = float(dbutils.widgets.get("corruption_rate"))
    except:
        # Fall back to command line arguments or defaults
        catalog = "default_catalog"
        schema = "default_schema"
        num_records = 1000
        corruption_rate = 0.05  # 5% corruption by default
        
        for arg in sys.argv[1:]:
            if arg.startswith("--catalog="):
                catalog = arg.split("=")[1]
            elif arg.startswith("--schema="):
                schema = arg.split("=")[1]
            elif arg.startswith("--num_records="):
                num_records = int(arg.split("=")[1])
            elif arg.startswith("--corruption_rate="):
                corruption_rate = float(arg.split("=")[1])
    
    return catalog, schema, num_records, corruption_rate

Job Configuration (YAML)

yaml

tasks:
  - task_key: generate_data
    environment_key: default
    notebook_task:
      notebook_path: ../src/layer/generate_data.py
      base_parameters:
        catalog: ${var.catalog}
        schema: ${var.schema}
        num_records: "1000"
        corruption_rate: "0.05"  # 5% corruption for DQ testing

Quick Patterns

Corruption Type Categories

•Missing Required Fields - Null or empty required fields
•Invalid Format/Length - Wrong format or below minimum length
•Out of Range Values - Excessive or negative values
•Business Logic Violations - Field relationships that violate rules
•Temporal Issues - Dates too old or in the future
•Referential Integrity Issues - Missing or invalid foreign keys

Dimension vs Fact Patterns

Dimensions are referenced by facts, so must be generated first. Use locale-specific Faker for realistic data.

Facts reference dimensions, so dimensions must exist first. Load dimension keys for referential integrity.

Data Volume Guidance

Generate enough records so patterns survive downstream aggregation (daily/weekly/regional GROUP BY):

Grain	Minimum Records	Rationale
Daily time series	50-100/day	Trends visible after weekly rollup
Per category	500+ per category	Statistical significance in charts
Per customer	5-20 events/customer	Customer-level analysis works
Total rows	10K-50K minimum	Patterns survive GROUP BY

python

# Example: 180 days of data
N_CUSTOMERS = 2500      # Dimension
N_ORDERS = 25000        # ~10 orders/customer, ~139/day
N_TICKETS = 8000        # ~44/day, enough for weekly trends

Common Mistakes to Avoid

❌ DON'T: Use uniform distributions

python

# BAD - everything equally likely (unrealistic)
prices = [random.uniform(10, 1000) for _ in range(N)]
regions = [random.choice(['N', 'S', 'E', 'W']) for _ in range(N)]

✅ DO: Use realistic distributions

python

# GOOD - log-normal for values, weighted for categories
prices = np.random.lognormal(mean=4.5, sigma=0.8, size=N)
regions = np.random.choice(['N', 'S', 'E', 'W'], size=N, p=[0.4, 0.25, 0.2, 0.15])

❌ DON'T: Generate flat temporal data

python

# BAD - ignores weekends, holidays, seasonality
dates = [fake.date_between(start_date='-6m', end_date='today') for _ in range(N)]

✅ DO: Add temporal patterns

python

# GOOD - weekday/weekend/holiday/spike effects
def get_daily_multiplier(date, us_holidays):
    mult = 1.0
    if date.weekday() >= 5: mult *= 0.6          # Weekend drop
    if date in us_holidays: mult *= 0.3           # Holiday drop
    mult *= 1 + 0.15 * (date.month - 6) / 6      # Q4 seasonality
    return max(0.1, mult * np.random.normal(1, 0.1))

❌ DON'T: Add pre-aggregated fields

python

# BAD - aggregation belongs in Gold layer
{"customer_id": cid, "total_orders": 47, "avg_csat": 4.2}

✅ DO: Generate raw transactional records

python

# GOOD - one row per event
{"order_id": "ORD-001", "customer_id": cid, "amount": 150.00}

❌ DON'T: Apply corruption before generating valid data

python

# BAD - hard to maintain
if should_corrupt:
    field = generate_invalid_field()
else:
    field = generate_valid_field()

✅ DO: Generate valid data first, then corrupt

python

# GOOD - clean separation
field = generate_valid_field()

if should_corrupt:
    field = corrupt_field(field)  # Modify valid data

❌ DON'T: Hardcode corruption without comments

python

# BAD - no DQ mapping
if corruption_type == 'bad_data':
    field = None

✅ DO: Document which expectation fails

python

# GOOD - clear DQ mapping
if corruption_type == 'null_required_field':
    # Will fail: valid_field_name
    field = None

❌ DON'T: Use magic numbers

python

# BAD - unclear threshold
if random.random() < 0.05:
    # What is 0.05?

✅ DO: Use named parameter

python

# GOOD - explicit parameter
should_corrupt = random.random() < corruption_rate

Testing Scenarios

Development: High Corruption

yaml

corruption_rate: "0.10"  # 10% for thorough testing

Staging: Realistic Corruption

yaml

corruption_rate: "0.05"  # 5% production-like

Production: No Synthetic Corruption

yaml

corruption_rate: "0.0"  # Real data only

Validation Checklist

Realism (CRITICAL)

• np.random.seed(SEED) AND Faker.seed(SEED) called at script top
• Monetary values use log-normal distribution (NOT uniform)
• Duration values use exponential distribution (NOT uniform)
• Categorical values use weighted probabilities (NOT equal)
• Row coherence: tier→amount, priority→resolution_time→CSAT correlations exist
• Time patterns: weekday/weekend/holiday/seasonality multipliers applied
• Dynamic date range: last 6 months from datetime.now()
• No pre-aggregated fields (total_x, sum_x, avg_x)
• Data volume: 10K-50K rows minimum, 50-100/day for time series

Corruption

• corruption_rate parameter added with default 0.05 (5%)
• Each corruption type has comment: # Will fail: <expectation_name>
• Corruption types map 1:1 to DLT expectations
• Valid data generated FIRST, then corrupted

Structure

• Parameter handling uses dbutils.widgets.get() (NOT argparse)
• Job YAML includes corruption_rate parameter
• Dimensions generated BEFORE facts
• Weighted sampling: dimension characteristics drive fact volume
• Referential integrity maintained (facts reference valid dimension keys)
• Validation prints at end (distribution checks, corruption stats)

Required Libraries

python

# ✅ ALWAYS include these in environment dependencies
dependencies:
  - "Faker==22.0.0"
  - "holidays>=0.40"
  - "numpy>=1.24.0"
  - "pandas>=2.0.0"

Use pandas for generation (faster row-by-row logic), convert to Spark DataFrame for saving to Delta tables.

Reference Files

•
Faker Providers - Detailed provider examples, corruption patterns, non-linear distribution patterns, time-based pattern functions, row coherence patterns, data volume guidance, and complete implementation examples. Includes locale-specific providers, business-specific providers, and domain-specific constants.
•
Generate Data Script - Data generation utility with standard function signatures, numpy-based distributions, weighted sampling, temporal patterns, seeding, and parameter handling. Includes generate_dimension_data(), generate_fact_data(), apply_corruption(), get_daily_multiplier(), and get_parameters() functions.

References

•Faker Documentation
•DLT Expectations