CSV Wrangling with qsv

Standard Workflow Order

Always follow this sequence when processing CSV data:

•Setup (Cowork) - qsv_get_working_dir (check current dir) -> qsv_set_working_dir (sync to workspace root if needed)
•Discover - sniff (detect format, encoding, delimiter) -> headers -> count
•Index - index (enables fast random access for subsequent commands)
•Profile - stats --cardinality --stats-jsonl (creates cache used by smart commands)
•Inspect - slice --len 5 (preview rows), frequency (value distributions)
•Transform - select, sort, dedup, apply, rename, search, etc.
•Validate - validate (against JSON Schema), stats (verify results)
•Export - to (Parquet, XLSX, etc.), tojsonl, table

Task	Best Tool	Alternative	When to Use Alternative
Select columns	`select`	`sqlp`	Need computed columns
Filter rows	`search`	`sqlp`	Complex WHERE conditions
Sort data	`sort`	`sqlp`	Need ORDER BY with LIMIT
Remove duplicates	`dedup`	`sqlp`	Need GROUP BY dedup
Join two files	`joinp`	`join`	`join` for memory-constrained
Aggregate/GROUP BY	`sqlp`	`frequency`	`frequency` for simple counts
Column stats	`stats`	`moarstats`	`moarstats` for extended stats
Find/replace	`apply operations`	`sqlp`	`sqlp` for conditional replace
Reshape wide->long	`melt`	`sqlp`	Complex reshaping
Reshape long->wide	`pivotp`	`sqlp`	Complex pivots
Concatenate files	`cat rows`	`cat rowskey`	Different column orders
Sample rows	`sample`	`slice`	`slice` for positional ranges

Used by select, search, sort, dedup, frequency, and other commands:

code

sniff -> index -> safenames -> fixlengths -> trim (apply operations) -> dedup -> validate

code

sniff -> index -> stats --cardinality --stats-jsonl -> frequency -> sqlp (GROUP BY queries)

For CSV > 10MB, convert to Parquet before SQL queries: sniff -> index -> stats -> to_parquet -> sqlp (using read_parquet())

code

index (both files) -> stats (both) -> joinp -> select (keep needed columns) -> sort

code

excel (to CSV) -> index -> stats -> select -> to parquet/xlsx

•Column indices are 1-based, not 0-based
•--no-headers flag changes behavior significantly - most commands assume headers exist
•Output goes to stdout by default; use --output file.csv to write to file
•Many commands auto-detect .sz (Snappy compressed) files transparently
•cat rows requires same column order; use cat rowskey for different schemas
•dedup loads all data into memory and sorts internally; use --sorted flag if input is already sorted to enable streaming mode with constant memory
•sort loads entire file into memory; for huge files use sqlp with ORDER BY
•For CSV > 10MB needing SQL queries, convert to Parquet first with qsv_to_parquet for dramatically faster SQL. Parquet works ONLY with sqlp and DuckDB -- all other qsv commands need CSV/TSV/SSV input