Privacy Checker Skill

Description

PII (Personally Identifiable Information) detection, anonymization, and privacy compliance checking for synthetic data generation.

Purpose

This custom skill enhances the synthetic data generator with:

•PII detection in source and synthetic data
•Privacy risk assessment
•Data anonymization and pseudonymization
•Compliance checking (GDPR, HIPAA, CCPA)
•Data leakage prevention

When to Use

Use this skill when:

•Processing sensitive data patterns
•Ensuring privacy compliance
•Detecting PII in pattern files
•Validating synthetic data doesn't leak real data
•Anonymizing sensitive fields
•Meeting regulatory requirements

Capabilities

1. PII Detection

•Direct Identifiers: Name, SSN, email, phone, address
•Quasi-Identifiers: Age, zip code, gender, birth date
•Sensitive Data: Medical records, financial data, biometrics
•Custom PII: Domain-specific sensitive fields
•Context-Aware: Uses field names and values

2. Privacy Risk Assessment

•Re-identification Risk: Probability of re-identifying individuals
•K-Anonymity: Check k-anonymity levels
•L-Diversity: Check l-diversity for sensitive attributes
•T-Closeness: Check t-closeness for distribution similarity
•Differential Privacy: Calculate privacy budget and noise

3. Anonymization Techniques

•Suppression: Remove PII fields
•Generalization: Replace with broader categories
•Pseudonymization: Replace with fake but consistent values
•Perturbation: Add statistical noise
•Tokenization: Replace with random tokens

4. Compliance Checking

•GDPR: EU data protection compliance
•HIPAA: Healthcare data privacy (US)
•CCPA: California consumer privacy
•SOC 2: Security and privacy controls
•PCI DSS: Payment card data security

5. Data Leakage Detection

•Exact Matches: Detect identical records
•Partial Matches: Detect similar records
•Statistical Leakage: Detect memorization
•Attribute Inference: Detect inferable attributes
•Membership Inference: Detect presence in training data

Usage Instructions

Step 1: Scan for PII

python

# Automatically scans during pattern analysis
pattern_analysis = await deep_analyze_pattern_tool({
    "file_path": "sensitive_data.csv",
    "analysis_depth": "comprehensive",
    "check_privacy": True  # Activates this skill
})

# Review PII findings
pii_report = pattern_analysis["privacy_assessment"]
if pii_report["contains_pii"]:
    print(f"⚠️  PII Detected:")
    for field, pii_type in pii_report["pii_fields"].items():
        print(f"  - {field}: {pii_type}")

Step 2: Anonymize Pattern Data

python

# Anonymize sensitive fields before analysis
anonymization_config = {
    "fields_to_anonymize": ["email", "phone", "ssn", "name"],
    "method": "pseudonymization",  # or "suppression", "generalization"
    "preserve_distributions": True
}

# Pattern analysis with anonymization
pattern_analysis = await deep_analyze_pattern_tool({
    "file_path": "sensitive_data.csv",
    "anonymize": anonymization_config
})

Step 3: Validate Privacy

python

# Check synthetic data for privacy violations
privacy_validation = await validate_quality_tool({
    "session_id": session_id,
    "original_data_path": "source_data.csv",
    "check_privacy": True,
    "privacy_checks": [
        "data_leakage",
        "pii_exposure",
        "k_anonymity",
        "l_diversity"
    ]
})

# Review privacy report
if privacy_validation["privacy"]["leakage_detected"]:
    print("❌ Privacy violation: Data leakage detected!")

PII Detection Patterns

Direct Identifiers

python

PII_PATTERNS = {
    "email": r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
    "phone": r'^\+?1?\d{9,15}$',
    "ssn": r'^\d{3}-\d{2}-\d{4}$',
    "credit_card": r'^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})$',
    "passport": r'^[A-Z]{1,2}[0-9]{6,9}$',
    "drivers_license": r'^[A-Z]{1,2}[0-9]{5,8}$'
}

NAME_INDICATORS = [
    "name", "first_name", "last_name", "full_name",
    "fname", "lname", "surname", "given_name"
]

ADDRESS_INDICATORS = [
    "address", "street", "city", "state", "zip",
    "postal", "country", "location", "residence"
]

Quasi-Identifiers

python

QUASI_IDENTIFIERS = {
    "age": "numeric_age",
    "birth_date": "date_of_birth",
    "gender": "gender",
    "zip_code": "postal_code",
    "race": "ethnicity",
    "occupation": "job_title"
}

Sensitive Attributes

python

SENSITIVE_FIELDS = {
    "medical": ["diagnosis", "medication", "medical_record", "health"],
    "financial": ["salary", "income", "account", "balance", "credit_score"],
    "biometric": ["fingerprint", "face", "iris", "voice", "dna"],
    "location": ["gps", "coordinates", "latitude", "longitude"]
}

Anonymization Methods

1. Suppression

python

# Completely remove PII fields
def suppress_field(dataframe, field_name):
    return dataframe.drop(columns=[field_name])

2. Generalization

python

# Replace specific values with broader categories
def generalize_age(age):
    if age < 18:
        return "0-17"
    elif age < 30:
        return "18-29"
    elif age < 50:
        return "30-49"
    elif age < 65:
        return "50-64"
    else:
        return "65+"

def generalize_zipcode(zipcode):
    # Replace last 2 digits with 00
    return zipcode[:3] + "00"

3. Pseudonymization

python

# Replace with consistent fake values
def pseudonymize_email(email, seed):
    # Generate consistent fake email for same input
    hash_value = hash(email + seed)
    return f"user{abs(hash_value)}@example.com"

def pseudonymize_name(name, faker_instance):
    # Use Faker with seed for consistency
    return faker_instance.name()

4. Perturbation

python

# Add statistical noise
def perturb_numeric(value, noise_level=0.1):
    noise = np.random.normal(0, value * noise_level)
    return value + noise

def perturb_date(date, max_days=30):
    days_offset = np.random.randint(-max_days, max_days)
    return date + timedelta(days=days_offset)

5. Tokenization

python

# Replace with random tokens
def tokenize_field(value, token_mapping):
    if value not in token_mapping:
        token_mapping[value] = generate_random_token()
    return token_mapping[value]

Privacy Metrics

K-Anonymity

python

def calculate_k_anonymity(dataframe, quasi_identifiers):
    """
    Ensure each combination of quasi-identifiers appears
    at least k times in the dataset
    """
    grouped = dataframe.groupby(quasi_identifiers).size()
    k = grouped.min()
    return k

# Good: k >= 5 (each combination appears at least 5 times)
# Poor: k < 3 (re-identification risk)

L-Diversity

python

def calculate_l_diversity(dataframe, quasi_identifiers, sensitive_attribute):
    """
    Ensure each quasi-identifier group has at least
    l distinct values for sensitive attribute
    """
    grouped = dataframe.groupby(quasi_identifiers)[sensitive_attribute]
    l = grouped.nunique().min()
    return l

# Good: l >= 3 (at least 3 distinct sensitive values per group)

T-Closeness

python

def calculate_t_closeness(dataframe, quasi_identifiers, sensitive_attribute):
    """
    Ensure distribution of sensitive attribute in each group
    is close to overall distribution
    """
    overall_dist = dataframe[sensitive_attribute].value_counts(normalize=True)
    max_distance = 0

    for group_id, group in dataframe.groupby(quasi_identifiers):
        group_dist = group[sensitive_attribute].value_counts(normalize=True)
        distance = earth_movers_distance(overall_dist, group_dist)
        max_distance = max(max_distance, distance)

    return max_distance

# Good: t < 0.2 (close distribution similarity)

Data Leakage Score

python

def calculate_leakage_score(synthetic_df, original_df):
    """
    Calculate probability of synthetic records matching original
    """
    # Check exact matches
    exact_matches = 0
    for idx, syn_row in synthetic_df.iterrows():
        if any((original_df == syn_row).all(axis=1)):
            exact_matches += 1

    # Check partial matches (>80% field similarity)
    partial_matches = 0
    for idx, syn_row in synthetic_df.iterrows():
        similarities = (original_df == syn_row).mean(axis=1)
        if similarities.max() > 0.8:
            partial_matches += 1

    leakage_score = {
        "exact_matches": exact_matches,
        "exact_match_rate": exact_matches / len(synthetic_df),
        "partial_matches": partial_matches,
        "partial_match_rate": partial_matches / len(synthetic_df)
    }

    return leakage_score

# Good: exact_match_rate = 0, partial_match_rate < 0.01

Privacy Risk Assessment

Overall Privacy Score

python

privacy_score = weighted_average([
    no_pii_leakage * 0.40,
    k_anonymity_compliance * 0.25,
    no_data_leakage * 0.20,
    l_diversity_compliance * 0.10,
    t_closeness_compliance * 0.05
])

# Pass criteria: privacy_score >= 0.90

Risk Levels

python

RISK_LEVELS = {
    "critical": {
        "score_range": [0, 0.5],
        "description": "High re-identification risk",
        "action": "Do not use - anonymize further"
    },
    "high": {
        "score_range": [0.5, 0.7],
        "description": "Moderate re-identification risk",
        "action": "Review and improve anonymization"
    },
    "medium": {
        "score_range": [0.7, 0.85],
        "description": "Low re-identification risk",
        "action": "Acceptable with monitoring"
    },
    "low": {
        "score_range": [0.85, 1.0],
        "description": "Minimal re-identification risk",
        "action": "Safe to use"
    }
}

Compliance Checklists

GDPR Compliance

yaml

gdpr_requirements:
  - no_direct_identifiers: true
  - pseudonymization: true
  - right_to_be_forgotten: true  # Can delete source data
  - data_minimization: true
  - purpose_limitation: true
  - storage_limitation: true

HIPAA Compliance

yaml

hipaa_requirements:
  - remove_18_identifiers: true  # HIPAA Safe Harbor
  - expert_determination: false  # Or expert review
  - de_identification_method: "safe_harbor"
  - covered_entities: ["PHI"]

CCPA Compliance

yaml

ccpa_requirements:
  - no_personal_information: true
  - consumer_rights: true
  - opt_out_sale: true
  - disclosure_requirements: true

Integration with Generation

Privacy-Preserving Generation

python

# Generate with privacy preservation
generation_result = await generate_with_modes_tool({
    "requirements": requirements,
    "num_rows": 10000,
    "mode": "balanced",
    "privacy_preserving": True,  # Activates privacy protections
    "privacy_config": {
        "differential_privacy": True,
        "epsilon": 1.0,  # Privacy budget
        "k_anonymity": 5,
        "suppress_pii": True
    }
})

Configuration

yaml

privacy_checker:
  pii_detection:
    enabled: true
    confidence_threshold: 0.8
    check_field_names: true
    check_field_values: true
  anonymization:
    default_method: "pseudonymization"
    preserve_distributions: true
  privacy_metrics:
    k_anonymity_threshold: 5
    l_diversity_threshold: 3
    t_closeness_threshold: 0.2
  compliance:
    check_gdpr: true
    check_hipaa: false
    check_ccpa: false

Best Practices

•Always Scan: Scan all source data for PII
•Anonymize First: Anonymize before analysis when possible
•Validate Privacy: Always run privacy validation
•Document PII: Document all PII handling
•Minimize Data: Collect only necessary fields
•Regular Audits: Audit privacy compliance regularly
•Legal Review: Consult legal for compliance questions

Example Privacy Report

json

{
  "privacy_score": 0.94,
  "risk_level": "low",
  "checks": {
    "pii_detection": {
      "pii_found": false,
      "fields_checked": 15,
      "sensitive_fields": []
    },
    "data_leakage": {
      "exact_matches": 0,
      "partial_matches": 0,
      "leakage_detected": false
    },
    "k_anonymity": {
      "k_value": 7,
      "threshold": 5,
      "passed": true
    },
    "l_diversity": {
      "l_value": 4,
      "threshold": 3,
      "passed": true
    }
  },
  "recommendations": [
    "Privacy standards met",
    "Safe for production use",
    "Continue monitoring for drift"
  ]
}

Error Handling

•PII Found: Warn and suggest anonymization
•Low K-Anonymity: Recommend generalization
•Data Leakage: Block export and regenerate
•Compliance Failure: Provide specific remediation steps
•Sensitive Data: Flag for manual review

Support

For privacy issues:

•Review PII detection report carefully
•Apply appropriate anonymization method
•Validate privacy metrics meet thresholds
•Consult legal for compliance questions
•Document all privacy decisions