AgentSkillsCN

Data Engineering

本项目致力于构建幂等且确定性的数据管道,其基本原则如下。

SKILL.md
--- frontmatter
name: Data Engineering
description: Principles for building idempotent, deterministic data pipelines in this project.

Data Engineering Principles

1. Explicit Determinism (First Principle)

  • Rule: Never blindly scan for only the "latest" file without context.
  • Guidance: Scripts must accept year and week parameters to target specific data partitions.
    • BAD: glob("data/raw/*") -> "Oh, I found 2024!"
    • GOOD: ingest(year=2024, week=1) -> "I am looking for data/raw/2024/week_1_*."

2. Idempotency via Partition Overwrite

  • Rule: Pipeline tasks must be re-runnable without side effects (duplication).
  • Pattern: Upsert / Replace Partition.
  • Implementation:
    python
    # 1. Clear the specific partition
    con.execute("DELETE FROM table WHERE year = ?", [year])
    # 2. Insert the new state for that partition
    con.execute("INSERT INTO table SELECT * FROM df")
    
  • Constraint: Do NOT drop and rebuild the entire table unless it is a full-history refresh (rare).

3. The Raw Layer is Immutable

  • Rule: Once data is written to data/raw/{year}/{week}_{ts}/, it is never modified.
  • Correction: If data is bad, scrape a NEW batch with a newer timestamp (e.g., week_{ts+1}) and re-run ingestion targeting that timestamp.