IO Utilities Guide

This skill provides comprehensive guidance for using the IO utilities in speedy_utils.

When to Use This Skill

Use this skill when you need to:

•Read and write data in various formats (JSON, JSONL, Pickle, CSV, TXT).
•Efficiently process large JSONL files with streaming and multi-threading.
•Automatically handle file compression (gzip, bz2, xz, zstd).
•Load data based on file extension automatically.
•Serialize Pydantic models and other objects easily.

Prerequisites

•speedy_utils installed.
•
Optional dependencies for specific features:
- •orjson: For faster JSON parsing.
- •zstandard: For .zst file support.
- •pandas: For CSV/TSV loading.
- •pyarrow: For faster CSV reading with pandas.

Core Capabilities

Fast JSONL Processing (`fast_load_jsonl`)

•Streams data line-by-line for memory efficiency.
•Supports automatic decompression.
•Uses orjson if available for speed.
•Supports multi-threaded processing for large files.
•Shows progress bar with tqdm.

Universal Loading (`load_by_ext`)

•Detects file type by extension.
•Supports glob patterns (e.g., data/*.json) and lists of files.
•Uses parallel processing for multiple files.
•Supports memoization via do_memoize=True.

Serialization (`dump_json_or_pickle`, `load_json_or_pickle`)

•Unified interface for JSON and Pickle.
•Handles Pydantic models automatically.
•Creates parent directories if they don't exist.

Usage Examples

Example 1: Streaming Large JSONL

Read a large compressed JSONL file line by line.

python

from speedy_utils import fast_load_jsonl

# Iterates lazily, low memory usage
for item in fast_load_jsonl('large_data.jsonl.gz', progress=True):
    process(item)

Example 2: Loading Any File

Load a file without worrying about the format.

python

from speedy_utils import load_by_ext

data = load_by_ext('config.json')
df = load_by_ext('data.csv')
items = load_by_ext('dataset.pkl')

Example 3: Parallel Loading

Load multiple files in parallel.

python

from speedy_utils import load_by_ext

# Returns a list of results, one for each file
all_data = load_by_ext('logs/*.jsonl')

Example 4: Dumping Data

Save data to disk, creating directories as needed.

python

from speedy_utils import dump_json_or_pickle

data = {"key": "value"}
dump_json_or_pickle(data, 'output/processed/result.json')

Guidelines

•
Prefer JSONL for Large Datasets:
- •Use fast_load_jsonl for datasets that don't fit in memory.
- •It handles compression transparently, so keep files compressed (.jsonl.gz or .jsonl.zst) to save space.
•
Use load_by_ext for Scripts:
- •When writing scripts that might accept different input formats, use load_by_ext to be flexible.
•
Error Handling:
- •fast_load_jsonl has an on_error parameter (raise, warn, skip) to handle malformed lines gracefully.
•
Performance:
- •Install orjson for significantly faster JSON operations.
- •load_by_ext uses pyarrow engine for CSVs if available, which is much faster.

Limitations

•Memory Usage: load_by_ext loads the entire file into memory. Use fast_load_jsonl for streaming.
•Glob Expansion: load_by_ext with glob patterns loads all matching files into memory at once (in a list). Be careful with massive datasets.

IO Utilities Guide

When to Use This Skill

Prerequisites

Core Capabilities

Fast JSONL Processing (fast_load_jsonl)

Universal Loading (load_by_ext)

Serialization (dump_json_or_pickle, load_json_or_pickle)

Usage Examples

Example 1: Streaming Large JSONL

Example 2: Loading Any File

Example 3: Parallel Loading

Example 4: Dumping Data

Guidelines

Limitations

Fast JSONL Processing (`fast_load_jsonl`)

Universal Loading (`load_by_ext`)

Serialization (`dump_json_or_pickle`, `load_json_or_pickle`)