Relational Mapper Skill

Description

Multi-table relationship detection and synthetic relational data generation. Handles foreign keys, referential integrity, and complex database schemas.

Purpose

This custom skill enhances the synthetic data generator with:

•Multi-table schema detection
•Foreign key relationship inference
•Referential integrity maintenance
•Relational database generation
•Complex dependency management

When to Use

Use this skill when:

•Generating multi-table synthetic databases
•Maintaining foreign key relationships
•Creating related datasets
•Building complex database schemas
•Testing relational database applications

Capabilities

1. Relationship Detection

•Primary Keys: Identify primary key columns
•Foreign Keys: Detect foreign key relationships
•One-to-One: 1:1 relationships
•One-to-Many: 1:N relationships
•Many-to-Many: N:M relationships via junction tables
•Self-Referencing: Hierarchical relationships

2. Schema Mapping

•Entity-Relationship Diagrams: Build ER diagrams
•Table Dependencies: Determine generation order
•Cardinality Detection: Identify relationship cardinality
•Cascade Rules: Detect ON DELETE/UPDATE rules
•Composite Keys: Handle multi-column keys

3. Referential Integrity

•Foreign Key Constraints: Maintain valid references
•Cascade Operations: Handle cascading updates/deletes
•Orphan Prevention: Prevent orphaned records
•Circular Dependencies: Handle circular references
•Constraint Validation: Validate all constraints

4. Multi-Table Generation

•Dependency Order: Generate tables in correct order
•Coordinated Generation: Maintain relationships during generation
•Batch Generation: Generate related records in batches
•Transaction Support: Ensure atomic operations
•Scalability: Handle large multi-table schemas

Usage Instructions

Step 1: Analyze Multi-Table Schema

python

# Analyze multiple related pattern files
relational_analysis = await analyze_relational_schema({
    "tables": {
        "customers": "customers_sample.csv",
        "orders": "orders_sample.csv",
        "order_items": "order_items_sample.csv",
        "products": "products_sample.csv"
    },
    "detect_relationships": True  # Activates this skill
})

# Review detected relationships
relationships = relational_analysis["relationships"]
for rel in relationships:
    print(f"{rel['from_table']}.{rel['from_column']} -> "
          f"{rel['to_table']}.{rel['to_column']} "
          f"({rel['relationship_type']})")

Step 2: Define or Refine Relationships

python

# Manually specify or refine relationships
relationship_config = {
    "relationships": [
        {
            "from_table": "orders",
            "from_column": "customer_id",
            "to_table": "customers",
            "to_column": "id",
            "type": "many_to_one",
            "on_delete": "CASCADE"
        },
        {
            "from_table": "order_items",
            "from_column": "order_id",
            "to_table": "orders",
            "to_column": "id",
            "type": "many_to_one",
            "on_delete": "CASCADE"
        },
        {
            "from_table": "order_items",
            "from_column": "product_id",
            "to_table": "products",
            "to_column": "id",
            "type": "many_to_one",
            "on_delete": "RESTRICT"
        }
    ]
}

Step 3: Generate Relational Data

python

# Generate all related tables
generation_result = await generate_relational_data({
    "schema": relational_analysis["schema"],
    "relationships": relationship_config["relationships"],
    "num_records": {
        "customers": 1000,
        "orders": 5000,  # 5 orders per customer on average
        "order_items": 15000,  # 3 items per order on average
        "products": 500
    },
    "maintain_referential_integrity": True
})

Step 4: Export to Database Format

python

# Export as SQL with relationships
export_result = await export_data_tool({
    "format": "sql",
    "output_path": "database_dump.sql",
    "session_id": generation_result["session_id"],
    "options": {
        "include_create_tables": True,
        "include_foreign_keys": True,
        "include_indexes": True
    }
})

Relationship Detection Algorithms

Foreign Key Detection

python

def detect_foreign_key(from_table, from_column, to_table, to_column):
    """
    Detect if from_column references to_column
    """
    # Check 1: Name similarity
    if is_fk_naming_pattern(from_column, to_column):
        confidence = 0.7
    else:
        confidence = 0.3

    # Check 2: Value overlap
    from_values = set(from_table[from_column].unique())
    to_values = set(to_table[to_column].unique())
    overlap = len(from_values & to_values) / len(from_values)

    if overlap > 0.9:
        confidence += 0.3
    elif overlap > 0.7:
        confidence += 0.2

    # Check 3: Data type compatibility
    if from_table[from_column].dtype == to_table[to_column].dtype:
        confidence += 0.1

    # Check 4: Cardinality
    if is_many_to_one_cardinality(from_table, from_column, to_table, to_column):
        confidence += 0.1

    return {
        "is_foreign_key": confidence > 0.8,
        "confidence": confidence,
        "overlap_ratio": overlap
    }

Cardinality Detection

python

def detect_cardinality(table1, col1, table2, col2):
    """
    Determine relationship cardinality
    """
    # Count unique values
    unique_in_table1 = table1[col1].nunique()
    unique_in_table2 = table2[col2].nunique()

    # Count duplicates
    dup_rate_table1 = 1 - (unique_in_table1 / len(table1))
    dup_rate_table2 = 1 - (unique_in_table2 / len(table2))

    if dup_rate_table1 < 0.01 and dup_rate_table2 < 0.01:
        return "one_to_one"
    elif dup_rate_table1 > 0.5 and dup_rate_table2 < 0.01:
        return "many_to_one"
    elif dup_rate_table1 < 0.01 and dup_rate_table2 > 0.5:
        return "one_to_many"
    else:
        return "many_to_many"

Dependency Order Calculation

python

def calculate_generation_order(tables, relationships):
    """
    Topologically sort tables based on foreign key dependencies
    """
    # Build dependency graph
    graph = {}
    for table in tables:
        graph[table] = []

    for rel in relationships:
        # from_table depends on to_table
        graph[rel["from_table"]].append(rel["to_table"])

    # Topological sort (Kahn's algorithm)
    order = []
    in_degree = {table: 0 for table in tables}

    for table in graph:
        for dep in graph[table]:
            in_degree[dep] += 1

    queue = [t for t in tables if in_degree[t] == 0]

    while queue:
        table = queue.pop(0)
        order.append(table)

        for dep in graph[table]:
            in_degree[dep] -= 1
            if in_degree[dep] == 0:
                queue.append(dep)

    if len(order) != len(tables):
        raise CircularDependencyError("Circular dependencies detected")

    return order

Relational Schema Output

Detected Schema

json

{
  "tables": {
    "customers": {
      "primary_key": "id",
      "columns": {
        "id": {"type": "integer", "nullable": false, "unique": true},
        "email": {"type": "string", "nullable": false, "unique": true},
        "name": {"type": "string", "nullable": false},
        "created_at": {"type": "timestamp", "nullable": false}
      }
    },
    "orders": {
      "primary_key": "id",
      "columns": {
        "id": {"type": "integer", "nullable": false, "unique": true},
        "customer_id": {"type": "integer", "nullable": false},
        "order_date": {"type": "date", "nullable": false},
        "total_amount": {"type": "decimal", "nullable": false}
      }
    },
    "order_items": {
      "primary_key": "id",
      "columns": {
        "id": {"type": "integer", "nullable": false, "unique": true},
        "order_id": {"type": "integer", "nullable": false},
        "product_id": {"type": "integer", "nullable": false},
        "quantity": {"type": "integer", "nullable": false},
        "price": {"type": "decimal", "nullable": false}
      }
    }
  },
  "relationships": [
    {
      "name": "fk_orders_customer",
      "from_table": "orders",
      "from_column": "customer_id",
      "to_table": "customers",
      "to_column": "id",
      "type": "many_to_one",
      "cardinality": "N:1",
      "on_delete": "CASCADE",
      "on_update": "CASCADE"
    },
    {
      "name": "fk_order_items_order",
      "from_table": "order_items",
      "from_column": "order_id",
      "to_table": "orders",
      "to_column": "id",
      "type": "many_to_one",
      "cardinality": "N:1",
      "on_delete": "CASCADE",
      "on_update": "CASCADE"
    }
  ],
  "generation_order": ["customers", "orders", "order_items"],
  "dependency_graph": {
    "nodes": ["customers", "orders", "order_items"],
    "edges": [
      {"from": "orders", "to": "customers"},
      {"from": "order_items", "to": "orders"}
    ]
  }
}

Generation Strategy

Dependency-Aware Generation

python

# Generate tables in dependency order
for table_name in generation_order:
    # Get parent table data if needed
    parent_data = {}
    for rel in relationships:
        if rel["from_table"] == table_name:
            parent_table = rel["to_table"]
            parent_col = rel["to_column"]
            parent_data[rel["from_column"]] = generated_data[parent_table][parent_col]

    # Generate table with foreign key constraints
    generated_data[table_name] = generate_table(
        schema=schema[table_name],
        num_rows=num_records[table_name],
        foreign_keys=parent_data,
        maintain_integrity=True
    )

Cardinality-Aware Generation

python

def generate_related_records(parent_records, cardinality_config):
    """
    Generate child records based on cardinality
    """
    child_records = []

    for parent_id in parent_records:
        # Determine number of child records
        if cardinality_config["type"] == "one_to_one":
            num_children = 1
        elif cardinality_config["type"] == "one_to_many":
            # Use Poisson distribution for realistic counts
            avg_children = cardinality_config.get("avg_children", 3)
            num_children = np.random.poisson(avg_children)
        else:  # many_to_many
            num_children = np.random.randint(
                cardinality_config.get("min_children", 1),
                cardinality_config.get("max_children", 10)
            )

        # Generate child records
        for _ in range(num_children):
            child_record = generate_record(
                schema=child_schema,
                foreign_key_value=parent_id
            )
            child_records.append(child_record)

    return child_records

Complex Scenarios

Many-to-Many Relationships

python

# Example: Students and Courses (many-to-many via enrollments)
schema = {
    "students": {...},
    "courses": {...},
    "enrollments": {  # Junction table
        "student_id": "FK -> students.id",
        "course_id": "FK -> courses.id",
        "enrollment_date": "date",
        "grade": "string"
    }
}

# Generate junction table with both foreign keys
generate_junction_table(
    table1="students",
    table2="courses",
    junction_table="enrollments",
    avg_links_per_table1_record=5,  # Avg 5 courses per student
    avg_links_per_table2_record=30  # Avg 30 students per course
)

Self-Referencing Relationships

python

# Example: Employees with manager_id referencing same table
schema = {
    "employees": {
        "id": "integer PRIMARY KEY",
        "name": "string",
        "manager_id": "integer FK -> employees.id",  # Self-reference
        "department": "string"
    }
}

# Generate with hierarchical structure
generate_hierarchical_table(
    schema=schema["employees"],
    num_records=1000,
    levels=4,  # Organization levels
    branching_factor=5  # Avg employees per manager
)

Circular Dependencies

python

# Example: Tables with circular foreign keys
# Table A -> Table B -> Table C -> Table A

# Strategy 1: Generate in stages
1. Generate Table A without FK to Table C
2. Generate Table B with FK to Table A
3. Generate Table C with FK to Table B
4. Update Table A with FK to Table C

# Strategy 2: Defer constraints
1. Generate all tables without FK constraints
2. Establish relationships after generation
3. Add FK constraints last

SQL Export with Relationships

CREATE TABLE Statements

sql

CREATE TABLE customers (
    id INTEGER PRIMARY KEY,
    email VARCHAR(255) NOT NULL UNIQUE,
    name VARCHAR(100) NOT NULL,
    created_at TIMESTAMP NOT NULL
);

CREATE TABLE orders (
    id INTEGER PRIMARY KEY,
    customer_id INTEGER NOT NULL,
    order_date DATE NOT NULL,
    total_amount DECIMAL(10, 2) NOT NULL,
    FOREIGN KEY (customer_id) REFERENCES customers(id)
        ON DELETE CASCADE
        ON UPDATE CASCADE
);

CREATE TABLE order_items (
    id INTEGER PRIMARY KEY,
    order_id INTEGER NOT NULL,
    product_id INTEGER NOT NULL,
    quantity INTEGER NOT NULL,
    price DECIMAL(10, 2) NOT NULL,
    FOREIGN KEY (order_id) REFERENCES orders(id)
        ON DELETE CASCADE,
    FOREIGN KEY (product_id) REFERENCES products(id)
        ON DELETE RESTRICT
);

CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_order_items_order_id ON order_items(order_id);
CREATE INDEX idx_order_items_product_id ON order_items(product_id);

Configuration

yaml

relational_mapper:
  relationship_detection:
    enabled: true
    confidence_threshold: 0.8
    check_naming_patterns: true
    check_value_overlap: true
  generation:
    maintain_referential_integrity: true
    use_transactions: true
    batch_size: 1000
  cardinality:
    one_to_many_avg: 3
    many_to_many_avg: 5
  export:
    include_foreign_keys: true
    include_indexes: true
    include_constraints: true

Best Practices

•Analyze First: Analyze all tables before generation
•Verify Relationships: Review detected relationships
•Test Small: Generate small datasets first
•Check Integrity: Validate referential integrity
•Use Transactions: Ensure atomic operations
•Index Foreign Keys: Add indexes for performance
•Document Schema: Maintain ER diagrams

Performance Optimization

•Batch Generation: Generate in batches for large datasets
•Parallel Tables: Generate independent tables in parallel
•Index Creation: Defer index creation until after inserts
•Constraint Checking: Defer constraint validation
•Memory Management: Stream large tables to disk

Example Use Cases

E-commerce Database

python

tables = {
    "customers": 10000,
    "products": 1000,
    "orders": 50000,
    "order_items": 150000,
    "reviews": 75000,
    "categories": 50,
    "product_categories": 2000  # many-to-many
}

Social Network

python

tables = {
    "users": 100000,
    "posts": 500000,
    "comments": 2000000,
    "likes": 5000000,
    "friendships": 200000  # many-to-many self-reference
}

Healthcare System

python

tables = {
    "patients": 50000,
    "doctors": 500,
    "appointments": 200000,
    "prescriptions": 300000,
    "medications": 1000,
    "diagnoses": 5000
}

Error Handling

•Circular Dependencies: Detect and resolve
•Missing References: Create parent records first
•Constraint Violations: Validate before insert
•Orphaned Records: Prevent with CASCADE rules
•Type Mismatches: Ensure compatible FK types

Support

For relational mapping issues:

•Review detected relationships
•Verify generation order
•Check referential integrity
•Validate cardinality settings
•Test with small datasets first