Data Lineage Mapper
Overview
Document the complete data journey from source systems through transformations to AI capability output. The goal is traceability that supports compliance, audit, and operational resilience.
Core principle: If you can't trace data from source to decision, you can't defend that decision to regulators or auditors.
Output Format
data_lineage_documentation:
capability: "[AI Capability Name]"
document_date: "[Date]"
lineage_owner: "[Data Governance Role]"
last_validated: "[Date]"
next_review: "[Date]"
source_system_inventory:
internal_sources:
- source_id: "[Unique ID]"
system: "[System Name]"
description: "[What this system is]"
data_elements:
- element: "[Data element name]"
description: "[What it contains]"
sensitivity: "[Classification]"
pii: true|false
refresh:
frequency: "[Batch/Real-time/On-demand]"
schedule: "[If batch, when]"
sla: "[Expected completion]"
ownership:
business_owner: "[Role/Team]"
technical_owner: "[Role/Team]"
data_steward: "[Role/Team]"
data_quality:
completeness: "[Measured %]"
accuracy: "[How verified]"
timeliness: "[Latency]"
external_sources:
- source_id: "[Unique ID]"
provider: "[Vendor Name]"
data_type: "[What data]"
integration_method: "[API/Batch/etc.]"
data_elements:
- element: "[Element]"
sensitivity: "[Classification]"
pii: true|false
refresh:
frequency: "[Frequency]"
sla: "[Expected latency]"
contract:
vendor_id: "[Internal vendor ID]"
contract_expiry: "[Date]"
data_usage_restrictions: "[Any restrictions]"
contingency:
backup_provider: "[Alternative if available]"
switch_time: "[How long to switch]"
impact_if_unavailable: "[What happens]"
data_flow_lineage:
stage_1_extraction:
description: "[What happens in this stage]"
flows:
- flow_id: "[Unique ID]"
source: "[Source system ID]"
destination: "[Where data goes]"
method: "[How transferred]"
schedule: "[When]"
transformation: "[What changes - None if raw]"
quality_check:
- check: "[Check name]"
threshold: "[Acceptance criteria]"
failure_action: "[What happens on fail]"
stage_2_transformation:
description: "[Stage description]"
flows:
- flow_id: "[ID]"
source: "[From]"
destination: "[To]"
transformation_logic:
- step: "[Step name]"
description: "[What happens]"
business_rule: "[Rule reference]"
quality_check:
- check: "[Check]"
threshold: "[Threshold]"
stage_3_enrichment:
description: "[External data integration]"
flows:
- flow_id: "[ID]"
trigger: "[When enrichment happens]"
process:
- call: "[External source ID]"
input: "[What's sent]"
output: "[What's received]"
timeout: "[Timeout]"
retry: "[Retry policy]"
fallback: "[If call fails]"
stage_4_feature_engineering:
description: "[ML feature creation]"
flows:
- flow_id: "[ID]"
source: "[Input data]"
destination: "[Feature store/output]"
transformations:
- feature: "[Feature name]"
logic: "[Calculation/derivation]"
source_fields: ["[Source field 1]", "[Source field 2]"]
documentation: "[Link to feature dictionary]"
stage_5_inference:
description: "[Model scoring]"
flows:
- flow_id: "[ID]"
source: "[Features]"
destination: "[Output location]"
model: "[Model name and version]"
output:
- field: "[Output field]"
type: "[Data type]"
audit_trail:
- "[What's logged]"
critical_path_analysis:
critical_sources:
- source: "[Source ID]"
criticality: "CRITICAL"
rationale: "[Why critical]"
impact_if_unavailable: "[Business impact]"
fallback: "[What to do if unavailable]"
rpo: "[Recovery point objective]"
enrichment_sources:
- source: "[Source ID]"
criticality: "ENRICHMENT"
rationale: "[Why enrichment vs critical]"
impact_if_unavailable: "[Degraded state]"
degradation_mode: "[How system operates without]"
failure_scenarios:
- scenario: "[Failure description]"
detection: "[How detected]"
response: "[What to do]"
maximum_degradation: "[Time limit for degraded operation]"
compliance_mapping:
data_residency:
- data_category: "[Category]"
requirement: "[Regulatory requirement]"
current_state: "[How compliant]"
evidence: "[Documentation]"
retention_requirements:
- data_element: "[Element or category]"
requirement: "[Retention period]"
regulation: "[Source regulation]"
implementation: "[How implemented]"
data_subject_rights:
- right: "[Right name - access, deletion, etc.]"
applicable_to: "[Which data subjects]"
process: "[How fulfilled]"
sla: "[Response time]"
constraints: "[Any limitations]"
vendor_dependency_analysis:
concentration_risk:
- category: "[Data category]"
primary_vendor: "[Vendor]"
market_share: "[If known]"
backup_vendor: "[Alternative]"
switch_complexity: "[LOW/MEDIUM/HIGH]"
vendor_risk_assessment:
- vendor: "[Vendor name]"
criticality: "[HIGH/MEDIUM/LOW]"
financial_stability: "[Assessment]"
data_security: "[Certifications]"
contract_terms:
- "[Key term 1]"
contingency_tested: "[Last test date and result]"
lineage_metadata:
documentation_refresh: "[Frequency]"
change_log:
- date: "[Date]"
change: "[What changed]"
approver: "[Who approved]"
Source System Documentation
Internal Sources
For each internal source, capture:
| Attribute | Why It Matters |
|---|---|
| System name and ID | Traceability |
| Data elements | Know exactly what flows |
| Sensitivity/PII | Drives handling requirements |
| Refresh frequency | Understand timeliness |
| SLA | Set expectations |
| Ownership (3 levels) | Accountability |
| Quality metrics | Trust assessment |
Ownership hierarchy:
- •Business owner: Accountable for data correctness
- •Technical owner: Accountable for system availability
- •Data steward: Day-to-day data quality management
External Sources
External vendors require additional documentation:
| Attribute | Why It Matters |
|---|---|
| Vendor ID | Link to vendor management |
| Contract expiry | Renewal planning |
| Usage restrictions | Compliance |
| Backup provider | Resilience |
| Switch time | Contingency planning |
Data Flow Stages
Stage 1: Extraction
Raw data movement from source to landing zone.
Key documentation:
- •Exact source and destination
- •Transfer method (batch, CDC, API)
- •Schedule and SLA
- •Quality checks at landing
Stage 2: Transformation
Data cleansing, standardization, enrichment.
Key documentation:
- •Transformation logic (specific rules)
- •Business rule references
- •Quality checks post-transformation
- •Data loss/filtering (and why)
Stage 3: External Enrichment
Integration of third-party data.
Key documentation:
- •What triggers enrichment
- •Input/output for each call
- •Timeout and retry policies
- •Fallback if unavailable
Stage 4: Feature Engineering
Derived features for ML models.
Key documentation:
- •Feature calculation logic
- •Source fields for each feature
- •Link to feature dictionary
- •Versioning approach
Stage 5: Inference
Model scoring and output.
Key documentation:
- •Model version used
- •Output fields and types
- •Audit trail captured
- •Downstream consumers
Quality Checkpoint Design
Every stage transition should have quality checks:
quality_checkpoint:
location: "[Between Stage X and Y]"
checks:
- check: "Completeness"
metric: "% records with all required fields"
threshold: ">99%"
action_if_fail: "Alert + manual review"
- check: "Referential integrity"
metric: "% records matching master data"
threshold: "100%"
action_if_fail: "Reject record + log"
- check: "Business rule validation"
metric: "% records passing rule X"
threshold: ">99.5%"
action_if_fail: "Route to exception queue"
- check: "Timeliness"
metric: "Data age at checkpoint"
threshold: "<4 hours"
action_if_fail: "Alert operations"
Critical Path Analysis
Criticality Classification
| Level | Definition | Example |
|---|---|---|
| CRITICAL | System cannot function without | Core transaction data |
| IMPORTANT | Significant degradation without | Enrichment data |
| ENRICHMENT | Nice to have, operates without | Supplemental analytics |
Failure Mode Documentation
For each critical source:
- •What fails: Specific failure scenario
- •How detected: Monitoring/alerting
- •Response: Immediate action
- •Degradation: How long can you operate degraded
- •Fallback: Alternative data source or process
Compliance Mapping
Data Residency
data_residency_map:
- data_category: "Canadian client PII"
regulation: "PIPEDA"
requirement: "Stored and processed in Canada"
current_state: "Canadian data center"
evidence: "Infrastructure topology doc"
risk_if_violated: "Regulatory action, client trust"
- data_category: "EU client data"
regulation: "GDPR"
requirement: "EU residency or adequate protection"
current_state: "EU data center + SCCs for US access"
evidence: "DPA, SCC documentation"
Retention Requirements
| Data Element | Retention | Regulation | Notes |
|---|---|---|---|
| Transaction records | 7 years | BSA/AML | From transaction date |
| Client communications | 6 years | FINRA 4511 | From creation |
| Model decisions | 7 years | Internal policy | Supports examination |
| Raw source data | Per source | Varies | May differ from derived |
Data Subject Rights
| Right | Process | SLA | Constraints |
|---|---|---|---|
| Access | Export from curated zone | 30 days | Format per request |
| Rectification | Update in source system | 30 days | Audit trail maintained |
| Erasure | Soft delete, then hard | 30 days | Subject to legal holds |
| Portability | Machine-readable export | 30 days | Standard format |
Vendor Dependency Analysis
Concentration Risk
Document single points of failure:
concentration_analysis:
- capability: "Sanctions screening"
primary: "Dow Jones"
backup: "LexisNexis"
switching_cost: "MEDIUM - 2-4 weeks integration"
risk_accepted: "Yes - tested annually"
- capability: "Market data"
primary: "Bloomberg"
backup: "Refinitiv (partial coverage)"
switching_cost: "HIGH - 3-6 months"
risk_accepted: "Yes - market standard concentration"
Vendor Contingency Testing
| Vendor | Last Test | Test Type | Result |
|---|---|---|---|
| DJ Risk | 2025-Q3 | Failover to backup | Success |
| Bureau | 2025-Q2 | Manual fallback | Success (4hr SLA) |
Common Mistakes
| Mistake | Why It's Wrong | Do This Instead |
|---|---|---|
| Only major systems | Misses intermediate data stores | Document every touchpoint |
| No ownership | Unclear accountability | Three-level ownership model |
| "Transformations applied" | Not traceable | Document specific logic |
| Generic quality checks | Not actionable | Specific thresholds and actions |
| No criticality assessment | Can't prioritize resilience | Classify every source |
| Vendor risk ignored | Concentration risk is real | Document and test contingencies |
Red Flags in Your Lineage
If your lineage documentation has these, it's incomplete:
- •Sources listed without ownership
- •"Data is transformed" without specific logic
- •No quality checkpoints between stages
- •Missing external vendor analysis
- •No failure scenarios documented
- •Retention requirements not mapped
- •No evidence of contingency testing
Financial Services Context
Data lineage for financial services AI requires:
Regulatory Traceability
- •Regulators ask "where does this data come from?"
- •Must trace any decision back to source
- •Especially important for AML, KYC, suitability
Multi-Jurisdiction Awareness
- •Data residency varies by client location
- •Cross-border flows need documentation
- •Privacy rights vary (GDPR, PIPEDA, CCPA)
Vendor Risk Management
- •External data is critical for many AI capabilities
- •Concentration risk must be documented
- •Contingency plans must be tested
Audit Readiness
- •Auditors will trace sample decisions
- •Documentation must enable that trace
- •Quality checkpoint evidence must exist