qsv Performance Guide

Three Accelerators

1. Index Files (`.csv.idx`)

Created by: qsv index Used by: count, slice, sample, split, stats, frequency, schema, and others marked with 📇

Benefit	Without Index	With Index
Row count	Scan entire file	Instant (stored in index)
Random access	Sequential scan	O(1) lookup
Multithreaded	Not possible	Enabled for many commands
Slicing	Read from start	Jump to position

Rule: Always run index first if you'll run 2+ commands on the same file.

Auto-indexing: The MCP server auto-indexes files > 10MB.

2. Stats Cache (`.stats.csv` + `.stats.csv.data.jsonl`)

Created by: qsv stats --cardinality --stats-jsonl Used by: frequency, schema, tojsonl, sqlp, joinp, pivotp, diff, sample (smart commands)

Smart Command	What It Uses from Cache
`frequency`	Cardinality to skip all-unique columns
`schema`	Data types for JSON Schema generation
`sqlp`	Column types for Polars optimization
`joinp`	Cardinality for optimal join order
`pivotp`	Cardinality to estimate output width
`diff`	Column types for comparison

Rule: Run stats --cardinality --stats-jsonl before using any smart command.

Auto-caching: The MCP server auto-adds --stats-jsonl to stats commands.

3. Polars Engine

Commands: sqlp, joinp, pivotp, count (with --polars-len), schema (with --polars)

Benefit	Standard (csv crate)	Polars Engine
Processing model	Row-by-row streaming	Vectorized columnar
Memory	Streaming (constant)	Columnar (efficient)
Parallelism	Single-threaded	Multi-threaded
Large files	Limited by memory	Larger-than-memory
SQL support	N/A	Full SQL dialect

Rule: Use Polars commands (sqlp, joinp, pivotp) for files > 100MB or complex queries.

Parquet Acceleration

For CSV > 10MB needing SQL queries, convert to Parquet first with qsv_to_parquet. Parquet is a columnar format that dramatically speeds up SQL queries in sqlp. Use read_parquet('file.parquet') as the table source. DuckDB is the preferred engine for Parquet queries; sqlp with SKIP_INPUT mode also works. Note: Parquet works ONLY with sqlp and DuckDB -- all other qsv commands require CSV/TSV/SSV input.

Memory-Aware Command Selection

Commands That Load Entire File into Memory (🤯)

dedup, reverse, sort, stats (with extended stats), table, transpose

Commands with Memory Proportional to Cardinality (😣)

frequency, join, schema, tojsonl

Streaming Commands (constant memory)

Everything else - select, search, slice, apply, count, etc.

Large File Decision Tree

code

File size?
├── < 10MB: Any command works fine
├── 10MB - 100MB:
│   ├── Always: index first
│   ├── SQL queries: convert to Parquet first with qsv_to_parquet
│   ├── Prefer: streaming commands
│   └── OK: memory-intensive if < available RAM
├── 100MB - 1GB:
│   ├── Always: index + stats cache first
│   ├── SQL queries: convert to Parquet first with qsv_to_parquet
│   ├── Prefer: Polars commands (sqlp, joinp, pivotp)
│   ├── Avoid: sort, reverse, table (load entire file)
│   └── Alternative: sqlp with ORDER BY LIMIT instead of sort
└── > 1GB:
    ├── Must: index + stats cache
    ├── SQL queries: convert to Parquet first with qsv_to_parquet
    ├── Must: Polars commands only for joins/queries
    ├── Avoid: all 🤯 commands
    └── Consider: split into chunks, process, cat rows

Performance Tips

Tip	Why
Use `--output file.csv`	Avoids stdout buffering overhead
Use `count` before `stats`	Fast row count for progress bars
Use `select` early in pipeline	Reduce columns = faster processing
Use `--no-headers` only when needed	Header detection is cheap
Use `slice --len N` for previews	Don't read entire file to inspect
Prefer `joinp` over `join`	Polars engine is significantly faster
Use `frequency --limit N`	Don't compute all unique values
Use `stats --cardinality`	Enables smart optimizations downstream

Concurrent Operations

The MCP server limits concurrent qsv operations (default: 1). Pipeline steps run sequentially. For multiple independent files, the agent can issue separate tool calls.

Timeout Handling

•Default timeout: 10 minutes (QSV_MCP_OPERATION_TIMEOUT_MS)
•Long operations (sort on huge files) may timeout
•If timeout occurs: try Polars alternative or split the file
•Exit code 124 indicates timeout