AgentSkillsCN

senior-data-engineer

世界级的数据工程技能,擅长构建可扩展的数据管道、ETL/ELT系统以及数据基础设施。精通Python、SQL、Spark、Airflow、dbt、Kafka以及现代数据栈。涵盖数据建模、管道编排、数据质量保障以及DataOps实践。适用于数据架构设计、数据管道构建、数据工作流优化,或数据治理方案的落地实施。

SKILL.md
--- frontmatter
name: senior-data-engineer
description: World-class data engineering skill for building scalable data pipelines, ETL/ELT systems, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt, Kafka, and modern data stack. Includes data modeling, pipeline orchestration, data quality, and DataOps. Use when designing data architectures, building data pipelines, optimizing data workflows, or implementing data governance.

Senior Data Engineer

Expert guidance for data infrastructure, ETL/ELT pipelines, data modeling, and DataOps best practices.

When to Use This Skill

Invoke this skill when you need help with:

  • Data Pipeline Design: Building ETL/ELT pipelines, orchestration, scheduling, error handling
  • Data Architecture: Designing data warehouses, data lakes, lakehouse architecture
  • Data Modeling: Dimensional modeling, normalization, denormalization strategies
  • Pipeline Orchestration: Airflow DAGs, workflow management, dependency handling
  • Data Quality: Validation, testing, monitoring data quality metrics
  • Performance Optimization: Query tuning, partitioning, indexing, caching strategies
  • DataOps: CI/CD for data pipelines, testing, monitoring, incident response
  • Stream Processing: Real-time data processing with Kafka, Flink, Spark Streaming

Core Competencies

Data Pipeline Development

  • ETL vs. ELT architectures
  • Incremental vs. full refresh strategies
  • Change data capture (CDC)
  • Error handling and retry logic
  • Backfilling historical data
  • Pipeline monitoring and alerting

Data Modeling & Warehousing

  • Star schema and snowflake schema
  • Slowly changing dimensions (SCD)
  • Fact and dimension tables
  • Data vault methodology
  • Kimball vs. Inmon approaches
  • Data lake vs. data warehouse

Orchestration & Workflow

  • Apache Airflow DAG design
  • Task dependencies and scheduling
  • Dynamic DAG generation
  • Sensor and trigger patterns
  • Workflow testing strategies
  • Backfill and reprocessing

Data Quality & Testing

  • Data validation rules
  • Schema enforcement
  • Data profiling
  • Anomaly detection
  • Data lineage tracking
  • Testing frameworks (Great Expectations)

Performance Optimization

  • Partitioning strategies
  • Columnar storage formats (Parquet, ORC)
  • Compression techniques
  • Query optimization
  • Materialized views
  • Incremental processing

DataOps & Observability

  • CI/CD for data pipelines
  • Data pipeline testing
  • Monitoring and alerting
  • SLA tracking
  • Incident response
  • Cost optimization

Tech Stack

Languages: Python, SQL, Scala

Orchestration: Apache Airflow, Prefect, Dagster

Processing: Apache Spark, dbt, Pandas

Streaming: Apache Kafka, Flink, Spark Streaming

Warehouses: Snowflake, BigQuery, Redshift, Databricks

Storage: S3, GCS, Azure Blob Storage

Formats: Parquet, Avro, ORC, Delta Lake

Quality: Great Expectations, dbt tests, soda-core

Monitoring: Datadog, Prometheus, Grafana

Approach

This skill follows the user's stated preferences:

  1. Analysis first: Profile data and understand requirements before building pipelines
  2. Present options: Show multiple approaches (batch vs. streaming, push vs. pull, etc.)
  3. Strategic guidance: Focus on architecture and design patterns, not just code
  4. Data quality: Emphasize testing, validation, and monitoring from the start
  5. Cost awareness: Consider compute costs, storage costs, and optimization opportunities