Journalistic Data Preprocessing
Preprocessing data for journalism requires higher standards than typical data science: every transformation must be traceable, every decision documented, and the human must approve substantive changes.
Core Principles
- •Provenance first: Every row traces to source file, sheet, and row number
- •No silent transformations: Never automatically fix issues without documentation
- •Human-in-the-loop: Present findings and get approval before transformations
- •Transparency artifacts: Generate documentation a reporter could hand to an editor
Workflow Overview
code
1. LOAD → Ingest data, establish provenance columns 2. AUDIT → Systematically examine every column for issues 3. REPORT → Present findings, proposed fixes, questions to user 4. TRANSFORM → After approval, execute documented transformations 5. VALIDATE → Confirm transformations, output final dataset + audit trail
Phase 1: Data Loading
Architectural Questions to Ask User
Before loading, clarify:
- •Structure: One table or multiple related tables?
- •Grain: What does each row represent?
- •Key fields: What uniquely identifies a record?
- •Time coverage: What date range does this cover?
Provenance Column Standard
Always add these columns to loaded data:
python
'_source_file' # Original filename '_source_sheet' # Sheet name (if Excel) or 'csv' '_source_row' # 1-indexed row number in original file '_load_timestamp' # When this record was loaded
Phase 2: Column Audit
Systematically examine every column.
Audit Checklist Per Column
| Category | What to Check |
|---|---|
| Type | Is inferred type correct? Mixed types? |
| Missing | How many nulls? Pattern to missingness? |
| Cardinality | Unique values vs total rows |
| Distribution | Outliers? Impossible values? |
| Text quality | Encoding issues? Entity variations? Typos? |
| Dates | Consistent format? Future or distant past dates? |
| Numeric | Scale consistent? Negative where unexpected? |
Phase 3: Issue Report
Generate a report for human review. See references/report-template.md for format.
Report Structure
code
# Data Quality Report: [Dataset Name] ## Summary - Total rows / columns / columns with issues ## Critical Issues (Require Decision) [Issues that could affect analysis validity] ## Warnings (Review Recommended) [Issues that may or may not need fixing] ## Proposed Transformations [Each transformation with rationale] ## Questions for Human Review [Decisions that require domain knowledge]
Key Report Principles
- •Severity tiers: Critical > Warning > Info
- •Concrete examples: Show actual problematic values
- •Proposed fixes: Suggest specific transformations
- •Preserve optionality: Let human choose between approaches
- •Audit artifacts: Identify what documentation to produce
Phase 4: Transformations
After human approval, execute transformations with full documentation.
Phase 5: Validation & Output
Final Outputs
- •Clean dataset (
cleaned_[name].csv) - Provenance columns preserved - •Transformation log (
transformation_log.csv) - Every change documented - •Audit report (
data_audit_report.md) - Issues, decisions, resolutions - •Entity mappings (
entity_mapping_[column].csv) - If standardization applied
Human Review Artifacts
| Decision Type | Artifact to Generate |
|---|---|
| Entity variations | Frequency table + proposed mapping |
| Outliers | Distribution summary + flagged values |
| Missing data | Missingness by column summary |
| Duplicates | Sample duplicate groups |
References
- •
references/report-template.md- Full report template with examples