Pooch - Data File Fetching
Quick Reference
python
import pooch
# Download single file
file_path = pooch.retrieve(
url="https://example.com/data.csv",
known_hash="sha256:abc123...", # None to skip verification
fname="data.csv",
path=pooch.os_cache("myproject")
)
# Create registry for multiple files
REGISTRY = pooch.create(
path=pooch.os_cache("myproject"),
base_url="https://example.com/data/",
registry={"data.csv": "sha256:abc123...", "model.nc": "sha256:def456..."}
)
data_file = REGISTRY.fetch("data.csv")
# Generate hash for local file
file_hash = pooch.file_hash("/path/to/file.csv")
Key Functions
| Function | Purpose |
|---|---|
pooch.retrieve() | Download single file with caching |
pooch.create() | Create custom data registry |
pooch.file_hash() | Generate SHA256/MD5 hash of file |
pooch.os_cache() | Get OS-specific cache directory |
Essential Operations
Download Files
python
# With hash verification
file_path = pooch.retrieve(
url="https://example.com/data.nc",
known_hash="sha256:abc123..."
)
# Without verification (development only)
file_path = pooch.retrieve(url="https://example.com/data.nc", known_hash=None)
# From Zenodo DOI
file_path = pooch.retrieve(
url="doi:10.5281/zenodo.1234567/data.zip",
known_hash="sha256:abc123..."
)
Extract Archives
python
# ZIP archive
files = pooch.retrieve(
url="https://example.com/data.zip",
known_hash="sha256:abc123...",
processor=pooch.Unzip()
)
# Decompress single gzip file
file_path = pooch.retrieve(
url="https://example.com/data.csv.gz",
known_hash="sha256:abc123...",
processor=pooch.Decompress(name="data.csv")
)
Additional Options
python
# Progress bar for large downloads
file_path = pooch.retrieve(url=url, known_hash=hash, progressbar=True)
# HTTP authentication
file_path = pooch.retrieve(
url="https://example.com/protected/data.csv",
known_hash=None,
downloader=pooch.HTTPDownloader(auth=("user", "pass"))
)
Processor Options
| Processor | Purpose |
|---|---|
Unzip() | Extract ZIP archives |
Untar() | Extract TAR/TAR.GZ archives |
Decompress() | Decompress gzip, bz2, lzma, xz |
Cache Locations
| OS | Default Path |
|---|---|
| Linux | ~/.cache/<project> |
| macOS | ~/Library/Caches/<project> |
| Windows | C:\Users\<user>\AppData\Local\<project>\Cache |
Error Handling
python
try:
file_path = pooch.retrieve(url=url, known_hash=hash)
except pooch.exceptions.HTTPDownloadError:
print("Download failed - check URL")
except pooch.exceptions.DownloadError:
print("Network issue")
References
- •Registry Configuration - Create and manage file registries
Scripts
- •scripts/create_registry.py - Generate registry from local files