AgentSkillsCN

Qsv Performance

Qsv 性能

SKILL.md

qsv Performance Guide

Three Accelerators

1. Index Files (.csv.idx)

Created by: qsv index Used by: count, slice, sample, split, stats, frequency, schema, and others marked with 📇

BenefitWithout IndexWith Index
Row countScan entire fileInstant (stored in index)
Random accessSequential scanO(1) lookup
MultithreadedNot possibleEnabled for many commands
SlicingRead from startJump to position

Rule: Always run index first if you'll run 2+ commands on the same file.

Auto-indexing: The MCP server auto-indexes files > 10MB.

2. Stats Cache (.stats.csv + .stats.csv.data.jsonl)

Created by: qsv stats --cardinality --stats-jsonl Used by: frequency, schema, tojsonl, sqlp, joinp, pivotp, diff, sample (smart commands)

Smart CommandWhat It Uses from Cache
frequencyCardinality to skip all-unique columns
schemaData types for JSON Schema generation
sqlpColumn types for Polars optimization
joinpCardinality for optimal join order
pivotpCardinality to estimate output width
diffColumn types for comparison

Rule: Run stats --cardinality --stats-jsonl before using any smart command.

Auto-caching: The MCP server auto-adds --stats-jsonl to stats commands.

3. Polars Engine

Commands: sqlp, joinp, pivotp, count (with --polars-len), schema (with --polars)

BenefitStandard (csv crate)Polars Engine
Processing modelRow-by-row streamingVectorized columnar
MemoryStreaming (constant)Columnar (efficient)
ParallelismSingle-threadedMulti-threaded
Large filesLimited by memoryLarger-than-memory
SQL supportN/AFull SQL dialect

Rule: Use Polars commands (sqlp, joinp, pivotp) for files > 100MB or complex queries.

Parquet Acceleration

For CSV > 10MB needing SQL queries, convert to Parquet first with qsv_to_parquet. Parquet is a columnar format that dramatically speeds up SQL queries in sqlp. Use read_parquet('file.parquet') as the table source. DuckDB is the preferred engine for Parquet queries; sqlp with SKIP_INPUT mode also works. Note: Parquet works ONLY with sqlp and DuckDB -- all other qsv commands require CSV/TSV/SSV input.

Memory-Aware Command Selection

Commands That Load Entire File into Memory (🤯)

dedup, reverse, sort, stats (with extended stats), table, transpose

Commands with Memory Proportional to Cardinality (😣)

frequency, join, schema, tojsonl

Streaming Commands (constant memory)

Everything else - select, search, slice, apply, count, etc.

Large File Decision Tree

code
File size?
├── < 10MB: Any command works fine
├── 10MB - 100MB:
│   ├── Always: index first
│   ├── SQL queries: convert to Parquet first with qsv_to_parquet
│   ├── Prefer: streaming commands
│   └── OK: memory-intensive if < available RAM
├── 100MB - 1GB:
│   ├── Always: index + stats cache first
│   ├── SQL queries: convert to Parquet first with qsv_to_parquet
│   ├── Prefer: Polars commands (sqlp, joinp, pivotp)
│   ├── Avoid: sort, reverse, table (load entire file)
│   └── Alternative: sqlp with ORDER BY LIMIT instead of sort
└── > 1GB:
    ├── Must: index + stats cache
    ├── SQL queries: convert to Parquet first with qsv_to_parquet
    ├── Must: Polars commands only for joins/queries
    ├── Avoid: all 🤯 commands
    └── Consider: split into chunks, process, cat rows

Performance Tips

TipWhy
Use --output file.csvAvoids stdout buffering overhead
Use count before statsFast row count for progress bars
Use select early in pipelineReduce columns = faster processing
Use --no-headers only when neededHeader detection is cheap
Use slice --len N for previewsDon't read entire file to inspect
Prefer joinp over joinPolars engine is significantly faster
Use frequency --limit NDon't compute all unique values
Use stats --cardinalityEnables smart optimizations downstream

Concurrent Operations

The MCP server limits concurrent qsv operations (default: 1). Pipeline steps run sequentially. For multiple independent files, the agent can issue separate tool calls.

Timeout Handling

  • Default timeout: 10 minutes (QSV_MCP_OPERATION_TIMEOUT_MS)
  • Long operations (sort on huge files) may timeout
  • If timeout occurs: try Polars alternative or split the file
  • Exit code 124 indicates timeout