AgentSkillsCN

data-engineering

全面的数据工程技能组合,涵盖核心库(Polars、DuckDB、PyArrow)、湖仓格式、云存储、编排、流式处理、质量保障、可观测性,以及AI/ML管道。

SKILL.md
--- frontmatter
name: data-engineering
description: "Comprehensive data engineering skill suite covering core libraries (Polars, DuckDB, PyArrow), lakehouse formats, cloud storage, orchestration, streaming, quality, observability, and AI/ML pipelines."

Data Engineering Hub

Welcome to the comprehensive data engineering skill suite. This hub organizes all data engineering knowledge into logical, non-overlapping domains.

Skill Map

DomainSkillsWhen to Use
Core@data-engineering-corePolars, DuckDB, PyArrow fundamentals; ETL patterns; error handling; performance optimization
Storage@data-engineering-storage-lakehouseDelta Lake, Apache Iceberg, Apache Hudi
@data-engineering-storage-remote-accessfsspec, pyarrow.fs, obstore; cloud access patterns
@data-engineering-storage-authenticationAWS, GCP, Azure auth - IAM roles, managed identity, secrets management
@data-engineering-storage-formatsParquet optimizations, Lance, Zarr, Avro, ORC
Orchestration@data-engineering-orchestrationPrefect, Dagster, dbt, workflow scheduling
Streaming@data-engineering-streamingKafka, MQTT, NATS JetStream for real-time data
Quality@data-engineering-qualityGreat Expectations, Pandera for data validation
Observability@data-engineering-observabilityOpenTelemetry, Prometheus for pipeline monitoring
AI/ML@data-engineering-ai-mlEmbeddings, vector databases, RAG pipelines
Best Practices@data-engineering-best-practicesMedallion architecture, partitioning, file sizing, incremental loads, schema evolution, testing
Catalogs@data-engineering-catalogsData catalog systems: Iceberg catalogs, DuckDB multi-source, Amundsen/DataHub/OpenMetadata

Quick Reference: Core Stack

TaskRecommended Tool
DataFrame operationsPolars (10-50x faster than pandas)
SQL analyticsDuckDB (embedded OLAP, zero-copy Arrow integration)
Data interchangePyArrow (Arrow format, zero-copy transfers)
Cloud storage accessfsspec (universal), pyarrow.fs (Arrow-native), obstore (high-performance)
Lakehouse formatDelta Lake (Spark ecosystem), Iceberg (engine-agnostic), Hudi (streaming CDC)
OrchestrationPrefect (Pythonic flows), Dagster (asset-based), dbt (SQL transformations)
ValidationPandera (lightweight), Great Expectations (enterprise)

Getting Started

New to Data Engineering?

Start with @data-engineering-core to learn the foundational libraries and patterns.

Working with Cloud Storage?

Go to @data-engineering-storage-remote-access for fsspec, pyarrow.fs, and obstore.

Building Data Lakes?

Explore @data-engineering-storage-lakehouse for ACID table formats.

Choosing a Data Catalog?

Check @data-engineering-catalogs for Iceberg catalogs, DuckDB multi-source patterns, and tool comparisons.

Production-Grade Pipelines?

Read @data-engineering-best-practices for medallion architecture, partitioning, schema evolution, and testing strategies.

Orchestrating Pipelines?

Check @data-engineering-orchestration for Prefect, Dagster, and dbt.

Production Monitoring?

See @data-engineering-observability for tracing and metrics.

AI/ML Data Pipelines?

Visit @data-engineering-ai-ml for embeddings, vector databases, and RAG.

Principles

  1. Lazy evaluation: Use Polars lazy frames and DuckDB query planning for performance
  2. Zero-copy data transfer: Leverage Arrow format for memory efficiency
  3. Pushdown optimization: Filter at storage layer to minimize data transfer
  4. Type safety: Use explicit schemas and type hints
  5. Resilience: Implement retries, circuit breakers, and proper error handling
  6. Observability: Instrument pipelines with traces and metrics
  7. Security: Never hardcode credentials; use IAM roles and environment variables

Migration from Legacy Skills

This restructured suite replaces the previous split organization (data-engineering-* and remote-filesystems-*). All content has been consolidated to eliminate duplication and clarify ownership.

Legacy skill replacements:

  • data-engineering-core@data-engineering-core (plus specific integrations)
  • data-engineering-lakehouse@data-engineering-storage-lakehouse
  • data-engineering-orchestration@data-engineering-orchestration
  • data-engineering-streaming@data-engineering-streaming
  • data-engineering-quality@data-engineering-quality
  • data-engineering-observability@data-engineering-observability
  • data-engineering-llm-pipelines@data-engineering-ai-ml
  • remote-filesystems-*@data-engineering-storage-remote-access and integrations

All legacy skills remain functional but are deprecated. New content should be added to the new structure only.