Schema Inference Skill

Description

Intelligent automatic schema detection and field type inference from data patterns. Analyzes data to infer schemas, types, constraints, and relationships.

Purpose

This custom skill enhances the synthetic data generator with:

•Automatic schema detection from pattern files
•Intelligent field type inference
•Constraint discovery and inference
•Relationship detection between fields
•Semantic meaning extraction

When to Use

Use this skill when:

•Analyzing pattern files without explicit schemas
•Inferring data types and constraints automatically
•Detecting field relationships and dependencies
•Understanding semantic meanings of fields
•Building schemas from sample data

Capabilities

1. Type Inference

•Primitive Types: int, float, string, boolean, date, timestamp
•Complex Types: arrays, objects, JSON, XML
•Semantic Types: email, phone, URL, IP address, UUID, SSN
•Geographic Types: address, city, state, country, coordinates, postal code
•Business Types: currency, percentage, SKU, product code

2. Constraint Inference

•Nullability: Required vs optional fields
•Uniqueness: Unique keys and identifiers
•Ranges: Min/max values for numeric fields
•Length: Min/max length for strings
•Patterns: Regex patterns for formatted strings
•Enumerations: Categorical field values

3. Relationship Inference

•Foreign Keys: References between tables/datasets
•Parent-Child: Hierarchical relationships
•One-to-Many: Field cardinality
•Many-to-Many: Complex relationships
•Functional Dependencies: Field dependencies

4. Semantic Analysis

•Field Names: Infer meaning from names
•Value Patterns: Detect semantic patterns
•Domain Knowledge: Apply domain-specific rules
•Context Clues: Use surrounding fields for context
•Business Logic: Infer business rules from data

Usage Instructions

Step 1: Automatic Schema Inference

python

# Automatically runs during pattern analysis
pattern_analysis = await deep_analyze_pattern_tool({
    "file_path": "unknown_schema.csv",
    "analysis_depth": "comprehensive",
    "infer_schema": True  # Activates this skill
})

# Access inferred schema
schema = pattern_analysis["schema"]

Step 2: Review Inferred Schema

python

for field_name, field_info in schema.items():
    print(f"Field: {field_name}")
    print(f"  Type: {field_info['type']}")
    print(f"  Semantic Type: {field_info['semantic_type']}")
    print(f"  Nullable: {field_info['nullable']}")
    print(f"  Unique: {field_info['unique']}")
    if field_info.get('constraints'):
        print(f"  Constraints: {field_info['constraints']}")

Step 3: Use Schema for Generation

python

# Inferred schema automatically used for generation
generation_result = await generate_with_modes_tool({
    "requirements": {"schema": schema},  # Uses inferred schema
    "num_rows": 10000,
    "mode": "balanced"
})

Inference Algorithms

Type Inference Algorithm

python

def infer_field_type(values):
    """
    Multi-stage type inference algorithm
    """
    # Stage 1: Check for special types
    if all_match_pattern(values, EMAIL_REGEX):
        return "email"
    if all_match_pattern(values, PHONE_REGEX):
        return "phone"
    if all_match_pattern(values, UUID_REGEX):
        return "uuid"

    # Stage 2: Try primitive type conversion
    if can_convert_to_int(values):
        if looks_like_id(values):
            return "identifier"
        return "integer"
    if can_convert_to_float(values):
        if looks_like_currency(values):
            return "currency"
        return "float"
    if can_convert_to_date(values):
        return "date" if no_time_component(values) else "timestamp"
    if can_convert_to_bool(values):
        return "boolean"

    # Stage 3: Check for categorical
    if unique_ratio(values) < 0.5:
        return "categorical"

    # Stage 4: Default to string
    return "string"

Constraint Inference Algorithm

python

def infer_constraints(field_name, values, inferred_type):
    """
    Infer constraints from data patterns
    """
    constraints = {}

    # Nullability
    constraints["nullable"] = has_null_values(values)

    # Uniqueness
    if unique_ratio(values) > 0.99:
        constraints["unique"] = True

    # Range constraints
    if inferred_type in ["integer", "float"]:
        constraints["min"] = min(values)
        constraints["max"] = max(values)

    # Length constraints
    if inferred_type == "string":
        constraints["min_length"] = min(len(v) for v in values)
        constraints["max_length"] = max(len(v) for v in values)

    # Pattern constraints
    if common_pattern := detect_common_pattern(values):
        constraints["pattern"] = common_pattern

    # Enum constraints
    if is_categorical(inferred_type, values):
        constraints["enum"] = list(set(values))

    return constraints

Relationship Inference Algorithm

python

def infer_relationships(schema, dataframe):
    """
    Detect relationships between fields
    """
    relationships = []

    for field1 in schema:
        for field2 in schema:
            if field1 == field2:
                continue

            # Check for foreign key relationship
            if is_foreign_key(dataframe[field1], dataframe[field2]):
                relationships.append({
                    "type": "foreign_key",
                    "from": field1,
                    "to": field2
                })

            # Check for functional dependency
            if is_functionally_dependent(dataframe[field1], dataframe[field2]):
                relationships.append({
                    "type": "functional_dependency",
                    "determinant": field1,
                    "dependent": field2
                })

    return relationships

Semantic Type Patterns

Email Detection

regex

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

Phone Number Detection

regex

^\+?1?\d{9,15}$

URL Detection

regex

^https?://[^\s/$.?#].[^\s]*$

UUID Detection

regex

^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$

IP Address Detection

regex

^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$

Credit Card Detection

regex

^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13})$

SSN Detection

regex

^\d{3}-\d{2}-\d{4}$

Currency Detection

python

# Patterns: $1,234.56 or 1234.56 or USD 1234.56
- Starts with currency symbol or code
- Contains thousands separators
- Has 2 decimal places

Postal Code Detection

python

# US: 12345 or 12345-6789
# UK: SW1A 1AA
# Canada: K1A 0B1
- Country-specific patterns
- Alphanumeric or numeric
- Optional separators

Inferred Schema Output

Example Inferred Schema

json

{
  "customer_id": {
    "type": "integer",
    "semantic_type": "identifier",
    "nullable": false,
    "unique": true,
    "constraints": {
      "min": 1,
      "max": 999999,
      "auto_increment": true
    }
  },
  "email": {
    "type": "string",
    "semantic_type": "email",
    "nullable": false,
    "unique": true,
    "constraints": {
      "pattern": "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$",
      "min_length": 5,
      "max_length": 254
    }
  },
  "age": {
    "type": "integer",
    "semantic_type": "age",
    "nullable": true,
    "unique": false,
    "constraints": {
      "min": 18,
      "max": 100
    }
  },
  "country": {
    "type": "string",
    "semantic_type": "country",
    "nullable": false,
    "unique": false,
    "constraints": {
      "enum": ["USA", "Canada", "UK", "Australia"]
    }
  },
  "created_at": {
    "type": "timestamp",
    "semantic_type": "timestamp",
    "nullable": false,
    "unique": false,
    "constraints": {
      "min": "2020-01-01T00:00:00Z",
      "max": "2025-11-01T00:00:00Z",
      "timezone": "UTC"
    }
  }
}

Relationship Output

json

{
  "relationships": [
    {
      "type": "foreign_key",
      "from_table": "orders",
      "from_field": "customer_id",
      "to_table": "customers",
      "to_field": "id"
    },
    {
      "type": "functional_dependency",
      "determinant": "postal_code",
      "dependent": ["city", "state"],
      "strength": 0.98
    }
  ]
}

Integration with Generation

Type-Specific Generation

python

# Based on inferred semantic type:
- email → Use Faker email generator
- phone → Use country-specific phone format
- UUID → Use UUID4 generation
- SSN → Use valid SSN format with checksums
- currency → Use appropriate decimal precision

Constraint-Aware Generation

python

# Based on inferred constraints:
- unique → Ensure no duplicates
- nullable → Generate nulls at observed rate
- range → Generate within min/max bounds
- pattern → Generate matching regex pattern
- enum → Sample from observed values

Relationship-Aware Generation

python

# Based on inferred relationships:
- foreign_key → Ensure referential integrity
- functional_dependency → Maintain dependencies
- one_to_many → Generate correct cardinality

Configuration

yaml

schema_inference:
  confidence_threshold: 0.8
  sample_size: 1000
  enable_semantic_detection: true
  enable_constraint_inference: true
  enable_relationship_detection: true
  semantic_type_patterns:
    email: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
    phone: "^\\+?1?\\d{9,15}$"
    uuid: "^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$"

Best Practices

•Sample Size: Use representative sample (min 100 rows)
•Data Quality: Clean data before inference
•Domain Knowledge: Review and refine inferred schema
•Validation: Validate inferred constraints
•Iteration: Refine schema through multiple iterations
•Documentation: Document schema inference decisions
•Manual Override: Allow manual schema specification

Error Handling

•Ambiguous Types: Report confidence scores
•Inconsistent Data: Flag inconsistencies
•Missing Patterns: Use default fallback types
•Conflicting Constraints: Use least restrictive
•Invalid Relationships: Warn about weak relationships

Performance

•Sampling: Analyze sample for large datasets (max 10K rows)
•Caching: Cache inferred schemas
•Parallel: Analyze fields in parallel
•Incremental: Update schema with new data
•Progressive: Refine schema progressively

Support

For schema inference issues:

•Check data quality and consistency
•Review confidence scores for ambiguous types
•Manually specify schema for complex cases
•Increase sample size for better inference
•Validate inferred schema before generation