IO Utilities Guide
This skill provides comprehensive guidance for using the IO utilities in speedy_utils.
When to Use This Skill
Use this skill when you need to:
- •Read and write data in various formats (JSON, JSONL, Pickle, CSV, TXT).
- •Efficiently process large JSONL files with streaming and multi-threading.
- •Automatically handle file compression (gzip, bz2, xz, zstd).
- •Load data based on file extension automatically.
- •Serialize Pydantic models and other objects easily.
Prerequisites
- •
speedy_utilsinstalled. - •Optional dependencies for specific features:
- •
orjson: For faster JSON parsing. - •
zstandard: For.zstfile support. - •
pandas: For CSV/TSV loading. - •
pyarrow: For faster CSV reading with pandas.
- •
Core Capabilities
Fast JSONL Processing (fast_load_jsonl)
- •Streams data line-by-line for memory efficiency.
- •Supports automatic decompression.
- •Uses
orjsonif available for speed. - •Supports multi-threaded processing for large files.
- •Shows progress bar with
tqdm.
Universal Loading (load_by_ext)
- •Detects file type by extension.
- •Supports glob patterns (e.g.,
data/*.json) and lists of files. - •Uses parallel processing for multiple files.
- •Supports memoization via
do_memoize=True.
Serialization (dump_json_or_pickle, load_json_or_pickle)
- •Unified interface for JSON and Pickle.
- •Handles Pydantic models automatically.
- •Creates parent directories if they don't exist.
Usage Examples
Example 1: Streaming Large JSONL
Read a large compressed JSONL file line by line.
python
from speedy_utils import fast_load_jsonl
# Iterates lazily, low memory usage
for item in fast_load_jsonl('large_data.jsonl.gz', progress=True):
process(item)
Example 2: Loading Any File
Load a file without worrying about the format.
python
from speedy_utils import load_by_ext
data = load_by_ext('config.json')
df = load_by_ext('data.csv')
items = load_by_ext('dataset.pkl')
Example 3: Parallel Loading
Load multiple files in parallel.
python
from speedy_utils import load_by_ext
# Returns a list of results, one for each file
all_data = load_by_ext('logs/*.jsonl')
Example 4: Dumping Data
Save data to disk, creating directories as needed.
python
from speedy_utils import dump_json_or_pickle
data = {"key": "value"}
dump_json_or_pickle(data, 'output/processed/result.json')
Guidelines
- •
Prefer JSONL for Large Datasets:
- •Use
fast_load_jsonlfor datasets that don't fit in memory. - •It handles compression transparently, so keep files compressed (
.jsonl.gzor.jsonl.zst) to save space.
- •Use
- •
Use
load_by_extfor Scripts:- •When writing scripts that might accept different input formats, use
load_by_extto be flexible.
- •When writing scripts that might accept different input formats, use
- •
Error Handling:
- •
fast_load_jsonlhas anon_errorparameter (raise,warn,skip) to handle malformed lines gracefully.
- •
- •
Performance:
- •Install
orjsonfor significantly faster JSON operations. - •
load_by_extusespyarrowengine for CSVs if available, which is much faster.
- •Install
Limitations
- •Memory Usage:
load_by_extloads the entire file into memory. Usefast_load_jsonlfor streaming. - •Glob Expansion:
load_by_extwith glob patterns loads all matching files into memory at once (in a list). Be careful with massive datasets.