AgentSkillsCN

data-lineage-mapper

在为受监管环境下的AI能力记录数据来源与流转路径时使用。建议在合规性审查或审计前使用。该技能可生成端到端的数据血缘追踪,明确数据所有权、质量控制节点,以及对应的监管映射。

SKILL.md
--- frontmatter
name: data-lineage-mapper
description: Use when documenting data sources and flows for AI capabilities in regulated environments. Use before compliance review or audit. Produces end-to-end lineage with ownership, quality checkpoints, and regulatory mapping.

Data Lineage Mapper

Overview

Document the complete data journey from source systems through transformations to AI capability output. The goal is traceability that supports compliance, audit, and operational resilience.

Core principle: If you can't trace data from source to decision, you can't defend that decision to regulators or auditors.

Output Format

yaml
data_lineage_documentation:
  capability: "[AI Capability Name]"
  document_date: "[Date]"
  lineage_owner: "[Data Governance Role]"
  last_validated: "[Date]"
  next_review: "[Date]"

source_system_inventory:
  internal_sources:
    - source_id: "[Unique ID]"
      system: "[System Name]"
      description: "[What this system is]"
      data_elements:
        - element: "[Data element name]"
          description: "[What it contains]"
          sensitivity: "[Classification]"
          pii: true|false
      refresh:
        frequency: "[Batch/Real-time/On-demand]"
        schedule: "[If batch, when]"
        sla: "[Expected completion]"
      ownership:
        business_owner: "[Role/Team]"
        technical_owner: "[Role/Team]"
        data_steward: "[Role/Team]"
      data_quality:
        completeness: "[Measured %]"
        accuracy: "[How verified]"
        timeliness: "[Latency]"

  external_sources:
    - source_id: "[Unique ID]"
      provider: "[Vendor Name]"
      data_type: "[What data]"
      integration_method: "[API/Batch/etc.]"
      data_elements:
        - element: "[Element]"
          sensitivity: "[Classification]"
          pii: true|false
      refresh:
        frequency: "[Frequency]"
        sla: "[Expected latency]"
      contract:
        vendor_id: "[Internal vendor ID]"
        contract_expiry: "[Date]"
        data_usage_restrictions: "[Any restrictions]"
      contingency:
        backup_provider: "[Alternative if available]"
        switch_time: "[How long to switch]"
        impact_if_unavailable: "[What happens]"

data_flow_lineage:
  stage_1_extraction:
    description: "[What happens in this stage]"
    flows:
      - flow_id: "[Unique ID]"
        source: "[Source system ID]"
        destination: "[Where data goes]"
        method: "[How transferred]"
        schedule: "[When]"
        transformation: "[What changes - None if raw]"
        quality_check:
          - check: "[Check name]"
            threshold: "[Acceptance criteria]"
            failure_action: "[What happens on fail]"

  stage_2_transformation:
    description: "[Stage description]"
    flows:
      - flow_id: "[ID]"
        source: "[From]"
        destination: "[To]"
        transformation_logic:
          - step: "[Step name]"
            description: "[What happens]"
            business_rule: "[Rule reference]"
        quality_check:
          - check: "[Check]"
            threshold: "[Threshold]"

  stage_3_enrichment:
    description: "[External data integration]"
    flows:
      - flow_id: "[ID]"
        trigger: "[When enrichment happens]"
        process:
          - call: "[External source ID]"
            input: "[What's sent]"
            output: "[What's received]"
            timeout: "[Timeout]"
            retry: "[Retry policy]"
            fallback: "[If call fails]"

  stage_4_feature_engineering:
    description: "[ML feature creation]"
    flows:
      - flow_id: "[ID]"
        source: "[Input data]"
        destination: "[Feature store/output]"
        transformations:
          - feature: "[Feature name]"
            logic: "[Calculation/derivation]"
            source_fields: ["[Source field 1]", "[Source field 2]"]
        documentation: "[Link to feature dictionary]"

  stage_5_inference:
    description: "[Model scoring]"
    flows:
      - flow_id: "[ID]"
        source: "[Features]"
        destination: "[Output location]"
        model: "[Model name and version]"
        output:
          - field: "[Output field]"
            type: "[Data type]"
        audit_trail:
          - "[What's logged]"

critical_path_analysis:
  critical_sources:
    - source: "[Source ID]"
      criticality: "CRITICAL"
      rationale: "[Why critical]"
      impact_if_unavailable: "[Business impact]"
      fallback: "[What to do if unavailable]"
      rpo: "[Recovery point objective]"

  enrichment_sources:
    - source: "[Source ID]"
      criticality: "ENRICHMENT"
      rationale: "[Why enrichment vs critical]"
      impact_if_unavailable: "[Degraded state]"
      degradation_mode: "[How system operates without]"

  failure_scenarios:
    - scenario: "[Failure description]"
      detection: "[How detected]"
      response: "[What to do]"
      maximum_degradation: "[Time limit for degraded operation]"

compliance_mapping:
  data_residency:
    - data_category: "[Category]"
      requirement: "[Regulatory requirement]"
      current_state: "[How compliant]"
      evidence: "[Documentation]"

  retention_requirements:
    - data_element: "[Element or category]"
      requirement: "[Retention period]"
      regulation: "[Source regulation]"
      implementation: "[How implemented]"

  data_subject_rights:
    - right: "[Right name - access, deletion, etc.]"
      applicable_to: "[Which data subjects]"
      process: "[How fulfilled]"
      sla: "[Response time]"
      constraints: "[Any limitations]"

vendor_dependency_analysis:
  concentration_risk:
    - category: "[Data category]"
      primary_vendor: "[Vendor]"
      market_share: "[If known]"
      backup_vendor: "[Alternative]"
      switch_complexity: "[LOW/MEDIUM/HIGH]"

  vendor_risk_assessment:
    - vendor: "[Vendor name]"
      criticality: "[HIGH/MEDIUM/LOW]"
      financial_stability: "[Assessment]"
      data_security: "[Certifications]"
      contract_terms:
        - "[Key term 1]"
      contingency_tested: "[Last test date and result]"

lineage_metadata:
  documentation_refresh: "[Frequency]"
  change_log:
    - date: "[Date]"
      change: "[What changed]"
      approver: "[Who approved]"

Source System Documentation

Internal Sources

For each internal source, capture:

AttributeWhy It Matters
System name and IDTraceability
Data elementsKnow exactly what flows
Sensitivity/PIIDrives handling requirements
Refresh frequencyUnderstand timeliness
SLASet expectations
Ownership (3 levels)Accountability
Quality metricsTrust assessment

Ownership hierarchy:

  • Business owner: Accountable for data correctness
  • Technical owner: Accountable for system availability
  • Data steward: Day-to-day data quality management

External Sources

External vendors require additional documentation:

AttributeWhy It Matters
Vendor IDLink to vendor management
Contract expiryRenewal planning
Usage restrictionsCompliance
Backup providerResilience
Switch timeContingency planning

Data Flow Stages

Stage 1: Extraction

Raw data movement from source to landing zone.

Key documentation:

  • Exact source and destination
  • Transfer method (batch, CDC, API)
  • Schedule and SLA
  • Quality checks at landing

Stage 2: Transformation

Data cleansing, standardization, enrichment.

Key documentation:

  • Transformation logic (specific rules)
  • Business rule references
  • Quality checks post-transformation
  • Data loss/filtering (and why)

Stage 3: External Enrichment

Integration of third-party data.

Key documentation:

  • What triggers enrichment
  • Input/output for each call
  • Timeout and retry policies
  • Fallback if unavailable

Stage 4: Feature Engineering

Derived features for ML models.

Key documentation:

  • Feature calculation logic
  • Source fields for each feature
  • Link to feature dictionary
  • Versioning approach

Stage 5: Inference

Model scoring and output.

Key documentation:

  • Model version used
  • Output fields and types
  • Audit trail captured
  • Downstream consumers

Quality Checkpoint Design

Every stage transition should have quality checks:

yaml
quality_checkpoint:
  location: "[Between Stage X and Y]"
  checks:
    - check: "Completeness"
      metric: "% records with all required fields"
      threshold: ">99%"
      action_if_fail: "Alert + manual review"

    - check: "Referential integrity"
      metric: "% records matching master data"
      threshold: "100%"
      action_if_fail: "Reject record + log"

    - check: "Business rule validation"
      metric: "% records passing rule X"
      threshold: ">99.5%"
      action_if_fail: "Route to exception queue"

    - check: "Timeliness"
      metric: "Data age at checkpoint"
      threshold: "<4 hours"
      action_if_fail: "Alert operations"

Critical Path Analysis

Criticality Classification

LevelDefinitionExample
CRITICALSystem cannot function withoutCore transaction data
IMPORTANTSignificant degradation withoutEnrichment data
ENRICHMENTNice to have, operates withoutSupplemental analytics

Failure Mode Documentation

For each critical source:

  1. What fails: Specific failure scenario
  2. How detected: Monitoring/alerting
  3. Response: Immediate action
  4. Degradation: How long can you operate degraded
  5. Fallback: Alternative data source or process

Compliance Mapping

Data Residency

yaml
data_residency_map:
  - data_category: "Canadian client PII"
    regulation: "PIPEDA"
    requirement: "Stored and processed in Canada"
    current_state: "Canadian data center"
    evidence: "Infrastructure topology doc"
    risk_if_violated: "Regulatory action, client trust"

  - data_category: "EU client data"
    regulation: "GDPR"
    requirement: "EU residency or adequate protection"
    current_state: "EU data center + SCCs for US access"
    evidence: "DPA, SCC documentation"

Retention Requirements

Data ElementRetentionRegulationNotes
Transaction records7 yearsBSA/AMLFrom transaction date
Client communications6 yearsFINRA 4511From creation
Model decisions7 yearsInternal policySupports examination
Raw source dataPer sourceVariesMay differ from derived

Data Subject Rights

RightProcessSLAConstraints
AccessExport from curated zone30 daysFormat per request
RectificationUpdate in source system30 daysAudit trail maintained
ErasureSoft delete, then hard30 daysSubject to legal holds
PortabilityMachine-readable export30 daysStandard format

Vendor Dependency Analysis

Concentration Risk

Document single points of failure:

yaml
concentration_analysis:
  - capability: "Sanctions screening"
    primary: "Dow Jones"
    backup: "LexisNexis"
    switching_cost: "MEDIUM - 2-4 weeks integration"
    risk_accepted: "Yes - tested annually"

  - capability: "Market data"
    primary: "Bloomberg"
    backup: "Refinitiv (partial coverage)"
    switching_cost: "HIGH - 3-6 months"
    risk_accepted: "Yes - market standard concentration"

Vendor Contingency Testing

VendorLast TestTest TypeResult
DJ Risk2025-Q3Failover to backupSuccess
Bureau2025-Q2Manual fallbackSuccess (4hr SLA)

Common Mistakes

MistakeWhy It's WrongDo This Instead
Only major systemsMisses intermediate data storesDocument every touchpoint
No ownershipUnclear accountabilityThree-level ownership model
"Transformations applied"Not traceableDocument specific logic
Generic quality checksNot actionableSpecific thresholds and actions
No criticality assessmentCan't prioritize resilienceClassify every source
Vendor risk ignoredConcentration risk is realDocument and test contingencies

Red Flags in Your Lineage

If your lineage documentation has these, it's incomplete:

  • Sources listed without ownership
  • "Data is transformed" without specific logic
  • No quality checkpoints between stages
  • Missing external vendor analysis
  • No failure scenarios documented
  • Retention requirements not mapped
  • No evidence of contingency testing

Financial Services Context

Data lineage for financial services AI requires:

Regulatory Traceability

  • Regulators ask "where does this data come from?"
  • Must trace any decision back to source
  • Especially important for AML, KYC, suitability

Multi-Jurisdiction Awareness

  • Data residency varies by client location
  • Cross-border flows need documentation
  • Privacy rights vary (GDPR, PIPEDA, CCPA)

Vendor Risk Management

  • External data is critical for many AI capabilities
  • Concentration risk must be documented
  • Contingency plans must be tested

Audit Readiness

  • Auditors will trace sample decisions
  • Documentation must enable that trace
  • Quality checkpoint evidence must exist