Data Import & Validation (Streaming + Error CSV + Metrics + Idempotency)

Name: data-import-parsers
Rating: 92
Author: janjaszczak

When to use this skill

Use when working on:

•Process input files sequentially (no “load everything then insert”) to control memory.
•Validate and coerce types explicitly (define allowed coercions; reject ambiguous cases).
•
Irreparable records must be skipped and logged to an error CSV containing:
- •all original columns exactly as seen in input
- •extra columns: timestamp, file, line, error
•Emit metrics at minimum: rows_ok, rows_skipped, parse_errors.
•DB writes must be idempotent: re-running the import must not duplicate or corrupt data.

•Identify input formats, volume, and “row identity” rules (keys/dedup strategy).
•Locate current import entrypoints + DB write layer.
•
Define:
- •schema mapping (source -> target columns)
- •validation rules per field
- •coercion rules per field (what is allowed, what is not)
- •error taxonomy (what counts as “irreparable”)

•Read one file at a time, row by row (or chunked) and write in bounded batches.
•Never accumulate full datasets in memory.
•Ensure progress logging is monotonic (e.g., file + row counters).

•Treat raw row values as immutable “source of truth”.
•
Perform coercions in a controlled layer:
- •return (ok, parsed_record) or (error, reason) per row
•
Separate concerns:
- •parsing (raw -> typed)
- •business validation (typed -> acceptable)
- •persistence (acceptable -> DB)

•
On any irreparable row:
- •write original row values (unmodified)
- •add: timestamp, file, line, error
•Avoid partial writes that drop context; every skipped row must be explainable from the CSV alone.

Maintain counters:

•rows_ok: successfully persisted rows
•rows_skipped: irreparable rows skipped
•parse_errors: count of parse/validation errors (can equal rows_skipped or be a superset if you track recoverable warnings separately)

Emit metrics:

•end-of-file summary
•end-of-run summary (aggregate over files) Optionally persist metrics to a JSON or a DB table for observability.

Pick and implement one clear strategy:

Idempotency must hold across: