AgentSkillsCN

pipeline

数据管道的设计与优化:ETL模式、任务编排、性能调优。适用于新数据管道的规划,或现有数据工作流的深度审视与优化。 适用场景如下: - “设计一条数据管道” - “对这条ETL流程进行复盘” - “优化数据处理流程” - “我该如何合理编排这些任务?” - “数据管道的架构设计”

SKILL.md
--- frontmatter
name: pipeline
description: |
  Data pipeline design and review: ETL patterns, orchestration, performance optimization.
  Use when designing new pipelines or reviewing existing data workflows.

  Use when:
  - "design a data pipeline"
  - "review this ETL"
  - "optimize data processing"
  - "how should I orchestrate this"
  - "pipeline architecture"

Pipeline Engineering Skill

Design, review, and optimize data pipelines and ETL workflows.

Quick Start

Design a Pipeline

bash
/wicked-data:pipeline design \
  --source "postgres://sales_db" \
  --target "s3://data-lake/sales" \
  --frequency daily

Generates: Architecture diagram, ETL logic, orchestration config, monitoring plan.

Review Existing Pipeline

bash
/wicked-data:pipeline review path/to/pipeline/

Analyzes: Code quality, error handling, performance, maintainability.

Pipeline Patterns

Batch ETL

Use when: Regular scheduled loads, historical processing Pattern: Extract → Transform → Validate → Load Tools: Airflow, Dagster, Prefect

Streaming Pipeline

Use when: Real-time processing, event-driven Pattern: Consume → Transform → Sink Tools: Kafka, Flink, Spark Streaming

Incremental Processing

Use when: Large datasets, only processing changes Pattern: Watermark tracking + Merge/Upsert

Pipeline Design Checklist

Architecture

  • Source systems identified and accessible
  • Data volume estimated (GB/day)
  • Latency requirements clear
  • Target schema defined
  • Orchestration tool selected

Data Quality

  • Schema validation at source
  • Null handling strategy
  • Duplicate detection
  • Business rule validation

Error Handling

  • Transient errors: retry with backoff
  • Fatal errors: alert and stop
  • Invalid records: log separately
  • Rollback/recovery strategy

Performance

  • Parallelization strategy
  • Batch size optimized
  • Bottlenecks identified
  • Scaling plan documented

Monitoring

  • Row counts logged
  • Processing duration tracked
  • Error rates monitored
  • Data freshness SLA defined

Operations

  • Backfill procedure documented
  • Replay capability implemented
  • Config externalized
  • Secrets managed securely

Common Issues

IssueSymptomsSolution
Fails halfwayPartial data, inconsistent stateStaging + commit pattern
DuplicatesSame data loaded multiple timesWatermarks + idempotency
Slow processingMisses SLAProfile and optimize bottlenecks

Integration

  • wicked-search: Find pipeline code with /wicked-search:code "dag|pipeline"
  • wicked-kanban: Track pipeline issues as tasks
  • wicked-mem: Recall pipeline patterns

Best Practices

  • Idempotency: Same input → same output, pipelines safely rerunnable
  • Observability: Log row counts, track duration, emit metrics, alert on anomalies
  • Testing: Unit test transforms, integration test full pipeline, test error scenarios
  • Documentation: Clear lineage, versioned schemas, operations runbook

External Integration Discovery

Pipeline engineering can leverage available integrations by capability:

CapabilityDiscovery PatternsProvides
Warehousessnowflake, databricks, bigqueryQuery execution, schema access
ETLairbyte, fivetran, dbtPipeline status, model metadata
Observabilitymonte-carlo, datadogData quality metrics

Run ListMcpResourcesTool to discover available integrations. Fall back to wicked-data:numbers for local file analysis.

Reference

For detailed patterns: