qsv Performance Guide
Three Accelerators
1. Index Files (.csv.idx)
Created by: qsv index
Used by: count, slice, sample, split, stats, frequency, schema, and others marked with 📇
| Benefit | Without Index | With Index |
|---|---|---|
| Row count | Scan entire file | Instant (stored in index) |
| Random access | Sequential scan | O(1) lookup |
| Multithreaded | Not possible | Enabled for many commands |
| Slicing | Read from start | Jump to position |
Rule: Always run index first if you'll run 2+ commands on the same file.
Auto-indexing: The MCP server auto-indexes files > 10MB.
2. Stats Cache (.stats.csv + .stats.csv.data.jsonl)
Created by: qsv stats --cardinality --stats-jsonl
Used by: frequency, schema, tojsonl, sqlp, joinp, pivotp, diff, sample (smart commands)
| Smart Command | What It Uses from Cache |
|---|---|
frequency | Cardinality to skip all-unique columns |
schema | Data types for JSON Schema generation |
sqlp | Column types for Polars optimization |
joinp | Cardinality for optimal join order |
pivotp | Cardinality to estimate output width |
diff | Column types for comparison |
Rule: Run stats --cardinality --stats-jsonl before using any smart command.
Auto-caching: The MCP server auto-adds --stats-jsonl to stats commands.
3. Polars Engine
Commands: sqlp, joinp, pivotp, count (with --polars-len), schema (with --polars)
| Benefit | Standard (csv crate) | Polars Engine |
|---|---|---|
| Processing model | Row-by-row streaming | Vectorized columnar |
| Memory | Streaming (constant) | Columnar (efficient) |
| Parallelism | Single-threaded | Multi-threaded |
| Large files | Limited by memory | Larger-than-memory |
| SQL support | N/A | Full SQL dialect |
Rule: Use Polars commands (sqlp, joinp, pivotp) for files > 100MB or complex queries.
Parquet Acceleration
For CSV > 10MB needing SQL queries, convert to Parquet first with qsv_to_parquet. Parquet is a columnar format that dramatically speeds up SQL queries in sqlp. Use read_parquet('file.parquet') as the table source. DuckDB is the preferred engine for Parquet queries; sqlp with SKIP_INPUT mode also works. Note: Parquet works ONLY with sqlp and DuckDB -- all other qsv commands require CSV/TSV/SSV input.
Memory-Aware Command Selection
Commands That Load Entire File into Memory (🤯)
dedup, reverse, sort, stats (with extended stats), table, transpose
Commands with Memory Proportional to Cardinality (😣)
frequency, join, schema, tojsonl
Streaming Commands (constant memory)
Everything else - select, search, slice, apply, count, etc.
Large File Decision Tree
File size?
├── < 10MB: Any command works fine
├── 10MB - 100MB:
│ ├── Always: index first
│ ├── SQL queries: convert to Parquet first with qsv_to_parquet
│ ├── Prefer: streaming commands
│ └── OK: memory-intensive if < available RAM
├── 100MB - 1GB:
│ ├── Always: index + stats cache first
│ ├── SQL queries: convert to Parquet first with qsv_to_parquet
│ ├── Prefer: Polars commands (sqlp, joinp, pivotp)
│ ├── Avoid: sort, reverse, table (load entire file)
│ └── Alternative: sqlp with ORDER BY LIMIT instead of sort
└── > 1GB:
├── Must: index + stats cache
├── SQL queries: convert to Parquet first with qsv_to_parquet
├── Must: Polars commands only for joins/queries
├── Avoid: all 🤯 commands
└── Consider: split into chunks, process, cat rows
Performance Tips
| Tip | Why |
|---|---|
Use --output file.csv | Avoids stdout buffering overhead |
Use count before stats | Fast row count for progress bars |
Use select early in pipeline | Reduce columns = faster processing |
Use --no-headers only when needed | Header detection is cheap |
Use slice --len N for previews | Don't read entire file to inspect |
Prefer joinp over join | Polars engine is significantly faster |
Use frequency --limit N | Don't compute all unique values |
Use stats --cardinality | Enables smart optimizations downstream |
Concurrent Operations
The MCP server limits concurrent qsv operations (default: 1). Pipeline steps run sequentially. For multiple independent files, the agent can issue separate tool calls.
Timeout Handling
- •Default timeout: 10 minutes (
QSV_MCP_OPERATION_TIMEOUT_MS) - •Long operations (sort on huge files) may timeout
- •If timeout occurs: try Polars alternative or split the file
- •Exit code 124 indicates timeout