AgentSkillsCN

Csv Wrangling

Csv 数据清洗

SKILL.md

CSV Wrangling with qsv

Standard Workflow Order

Always follow this sequence when processing CSV data:

  1. Setup (Cowork) - qsv_get_working_dir (check current dir) -> qsv_set_working_dir (sync to workspace root if needed)
  2. Discover - sniff (detect format, encoding, delimiter) -> headers -> count
  3. Index - index (enables fast random access for subsequent commands)
  4. Profile - stats --cardinality --stats-jsonl (creates cache used by smart commands)
  5. Inspect - slice --len 5 (preview rows), frequency (value distributions)
  6. Transform - select, sort, dedup, apply, rename, search, etc.
  7. Validate - validate (against JSON Schema), stats (verify results)
  8. Export - to (Parquet, XLSX, etc.), tojsonl, table

Tool Selection Matrix

TaskBest ToolAlternativeWhen to Use Alternative
Select columnsselectsqlpNeed computed columns
Filter rowssearchsqlpComplex WHERE conditions
Sort datasortsqlpNeed ORDER BY with LIMIT
Remove duplicatesdedupsqlpNeed GROUP BY dedup
Join two filesjoinpjoinjoin for memory-constrained
Aggregate/GROUP BYsqlpfrequencyfrequency for simple counts
Column statsstatsmoarstatsmoarstats for extended stats
Find/replaceapply operationssqlpsqlp for conditional replace
Reshape wide->longmeltsqlpComplex reshaping
Reshape long->widepivotpsqlpComplex pivots
Concatenate filescat rowscat rowskeyDifferent column orders
Sample rowssamplesliceslice for positional ranges

qsv Selection Syntax

Used by select, search, sort, dedup, frequency, and other commands:

SyntaxMeaningExample
nameColumn by nameselect "City"
1Column by 1-based indexselect 1
1,3,5Multiple columnsselect 1,3,5
1-5Range (inclusive)select 1-5
!colExclude columnselect '!SSN'
!1-3Exclude rangeselect '!1-3'
/regex/Match column namesselect '/^price/'

Common Pipeline Patterns

Clean and Deduplicate

code
sniff -> index -> safenames -> fixlengths -> trim (apply operations) -> dedup -> validate

Profile and Analyze

code
sniff -> index -> stats --cardinality --stats-jsonl -> frequency -> sqlp (GROUP BY queries)

For CSV > 10MB, convert to Parquet before SQL queries: sniff -> index -> stats -> to_parquet -> sqlp (using read_parquet())

Join and Enrich

code
index (both files) -> stats (both) -> joinp -> select (keep needed columns) -> sort

Convert and Export

code
excel (to CSV) -> index -> stats -> select -> to parquet/xlsx

Delimiter Handling

  • CSV (,): default, no flag needed
  • TSV (\t): use --delimiter '\t' or file extension .tsv
  • SSV (;): use --delimiter ';' or file extension .ssv
  • Auto-detect: set QSV_SNIFF_DELIMITER=1 environment variable

Important Notes

  • Column indices are 1-based, not 0-based
  • --no-headers flag changes behavior significantly - most commands assume headers exist
  • Output goes to stdout by default; use --output file.csv to write to file
  • Many commands auto-detect .sz (Snappy compressed) files transparently
  • cat rows requires same column order; use cat rowskey for different schemas
  • dedup loads all data into memory and sorts internally; use --sorted flag if input is already sorted to enable streaming mode with constant memory
  • sort loads entire file into memory; for huge files use sqlp with ORDER BY
  • For CSV > 10MB needing SQL queries, convert to Parquet first with qsv_to_parquet for dramatically faster SQL. Parquet works ONLY with sqlp and DuckDB -- all other qsv commands need CSV/TSV/SSV input