AgentSkillsCN

data-engineering-storage-remote-access

Python中的云存储访问:fsspec、pyarrow.fs、obstore库,以及与Polars、DuckDB、PyArrow、Delta Lake和Iceberg的集成。

SKILL.md
--- frontmatter
name: data-engineering-storage-remote-access
description: "Cloud storage access in Python: fsspec, pyarrow.fs, obstore libraries, plus integrations with Polars, DuckDB, PyArrow, Delta Lake, and Iceberg."
dependsOn: ["@data-engineering-core", "@data-engineering-storage-authentication", "@data-engineering-storage-formats"]

Remote Storage Access

Comprehensive guide to accessing cloud storage (S3, GCS, Azure) and remote filesystems in Python. Covers three major libraries - fsspec, pyarrow.fs, and obstore - and their integration with data engineering tools.

Quick Comparison

Featurefsspecpyarrow.fsobstore
Best ForBroad compatibility, ecosystem integrationArrow-native workflows, ParquetHigh-throughput, performance-critical
BackendsS3, GCS, Azure, HTTP, FTP, 20+ moreS3, GCS, HDFS, localS3, GCS, Azure, local
PerformanceGood (with caching)Excellent for Parquet9x faster for concurrent ops
DependenciesBackend-specific (s3fs, gcsfs)Bundled with PyArrowZero Python deps (Rust)
Async SupportYes (aiohttp)LimitedNative sync/async
DataFrame IntegrationUniversalPyArrow-nativeVia fsspec wrapper
MaturityVery mature (2018+)MatureNew (2025), rapidly evolving

When to Use Which?

Use fsspec when:

  • You need broad ecosystem compatibility (pandas, xarray, Dask)
  • Working with multiple storage backends (S3, GCS, Azure, HTTP)
  • You need protocol chaining and caching features
  • Your workflow involves diverse data formats beyond Parquet

Use pyarrow.fs when:

  • Your pipeline is Arrow/Parquet-native
  • You need zero-copy integration with PyArrow datasets
  • Predicate pushdown and column pruning are critical
  • Working with partitioned Parquet datasets

Use obstore when:

  • Performance is paramount (many small files, high concurrency)
  • You need async/await support for concurrent operations
  • You want minimal dependencies (Rust-based)
  • Working with large-scale data ingestion/egestion

Skill Dependencies

Prerequisites:

  • @data-engineering-core - Polars, DuckDB, PyArrow basics
  • @data-engineering-storage-authentication - AWS, GCP, Azure auth patterns
  • @data-engineering-storage-formats - Parquet, Arrow, Lance, Zarr, Avro, ORC

Related:

  • @data-engineering-storage-lakehouse - Delta Lake, Iceberg on cloud storage
  • @data-engineering-orchestration - dbt with cloud storage

Detailed Guides

Library Deep Dives

  • @data-engineering-storage-remote-access-libraries-fsspec - Universal filesystem interface
  • @data-engineering-storage-remote-access-libraries-pyarrow-fs - Native Arrow integration
  • @data-engineering-storage-remote-access-libraries-obstore - High-performance Rust

DataFrame Integrations

  • @data-engineering-storage-remote-access-integrations-polars - Polars + cloud URIs
  • @data-engineering-storage-remote-access-integrations-duckdb - DuckDB HTTPFS extension
  • @data-engineering-storage-remote-access-integrations-pandas - Pandas + remote files
  • @data-engineering-storage-remote-access-integrations-pyarrow - PyArrow datasets
  • @data-engineering-storage-remote-access-integrations-delta-lake - Delta on S3/GCS/Azure
  • @data-engineering-storage-remote-access-integrations-iceberg - Iceberg with cloud catalogs

Infrastructure Patterns

  • @data-engineering-storage-authentication - AWS, GCP, Azure auth patterns, IAM roles, service principals
  • See performance.md in this skill - Caching, concurrency, async
  • See patterns.md in this skill - Incremental loading, partitioned writes, cross-cloud copy

Storage Formats

  • @data-engineering-storage-formats - Parquet, Arrow/Feather, Lance, Zarr, Avro, ORC

Quick Start Example

python
import fsspec
import pyarrow.fs as fs
import obstore as obs

# Method 1: fsspec (universal)
s3_fs = fsspec.filesystem('s3')
with s3_fs.open('s3://bucket/data.parquet', 'rb') as f:
    df = pl.read_parquet(f)

# Method 2: pyarrow.fs (Arrow-native)
s3_pa = fs.S3FileSystem(region='us-east-1')
table = pq.read_table("bucket/data.parquet", filesystem=s3_pa)

# Method 3: obstore (high-performance)
from obstore.store import S3Store
store = S3Store(bucket='my-bucket', region='us-east-1')
data = obs.get(store, 'data.parquet').bytes()

# All approaches work - choose based on your performance and ecosystem needs

Authentication

All three libraries follow standard cloud authentication patterns: explicit credentials → environment variables → config files → IAM roles/Managed Identities.

See: @data-engineering-storage-authentication

Performance Optimization

Key strategies:

  • Caching: fsspec's SimpleCache for repeated access
  • Concurrency: obstore async API for many small files
  • Predicate pushdown: Filter at storage layer using partitioning
  • Column pruning: Read only required columns

See: @data-engineering-storage-remote-access/performance.md


References