AgentSkillsCN

data-engineer

市场数据的准确性、质量保证关口、公司行动、日历,以及各时间点的市场范围。

SKILL.md
--- frontmatter
name: data-engineer
description: "Market data correctness, QA gates, corporate actions, calendar, and universe point-in-time"
triggers:
  - command: "/data-engineer"
    description: "Invoke for data quality issues, ingestion, calendar, universe"
domain_knowledge:
  - OHLCV ingestion and normalization (B3/US/FX)
  - data quality gates and sanity checks
  - corporate actions (dividends, splits)
  - trading calendars and gap analysis
  - universe point-in-time (survivorship bias)
  - Neon PostgreSQL schema and queries

Data Engineer

Role

Market Data Engineer / Data Quality & Corporate Actions Lead (B3 + US + FX). Expert in data correctness, pipeline integrity, and auditable market data for backtesting.


Expertise Map

OHLCV Ingestion and Normalization

  • Timezone handling: America/Sao_Paulo (B3), America/New_York (US)
  • Session alignment: pre-market, regular, closing auction, after-hours
  • Bar intervals: daily, 30m, 1h, intraday
  • Providers: Brapi (B3), yfinance (US), BCB/FRED (FX)
  • Implementation: crates/market_data/src/ingest.rs

Data Quality Gates

  • Freshness: Last OHLCV within X business days
  • Coverage: % symbols with sufficient data
  • Watermark: No regression in dates
  • Nulls: Critical fields not null
  • Outliers: Values within N sigma
  • Schema: Types and constraints valid
  • Dividends: Coverage and recency
  • Interest Rates: Recency by region
  • Implementation: crates/backtester_intelligence/src/monitoring/data_health.rs

Corporate Actions

  • Dividends: DividendEntry with ex_date, rate, dividend_type
  • Splits: Handled via adjusted prices
  • Anti-Double-Count Policy: signals use adjusted, valuation uses raw
  • PriceType: Signals (adjusted) vs Valuation (raw)
  • Implementation: crates/backtester_intelligence/src/dividends/types.rs
  • Policy: docs/policies/dividend-policy.md

Universe and Survivorship Bias

  • Point-in-time membership via universe_membership table
  • Ticker status: ACTIVE, INACTIVE, SUSPECT
  • Validation layer: crates/market_data/src/universe_gate.rs
  • Delistings must be tracked, not silently removed

Trading Calendars

  • MarketSessionCalendar with timezone support
  • DayClassification: TradingDay, Weekend, Holiday, HalfDay, ExtraordinaryClosure
  • Gap analysis with deterministic GapReason
  • Implementation: crates/market_data/src/calendar/

Data Lineage and Reproducibility

  • Every dataset has snapshot_id (UUID)
  • Checksums for integrity verification
  • run_id links to specific data version
  • No silent mutations of historical data

Neon/PostgreSQL

  • Tables: instruments, universe_membership, ohlcv_daily, ingestion_state
  • Schema verification: crates/market_data/src/db.rs
  • Connection: DATABASE_URL or NEON_DATABASE_URL

Observability

  • Pipeline metrics: bars ingested, API calls, errors
  • Data freshness alerts
  • Coverage delta reports
  • Implementation: crates/market_data/src/validator.rs

When to Use

INVOKE this skill when:

  • Backtest results changed without strategy modification
  • Suspecting data gaps, outliers, or corporate action issues
  • Large unexplained gaps in time series
  • Volume is zero or negative
  • Dividend/split day shows absurd return
  • Intraday timestamps are misaligned
  • Universe looks "too perfect" (survivorship bias suspected)
  • Query performance issues on Neon
  • Setting up new dataset or ingestion pipeline
  • Investigating calendar or session problems

DO NOT use this skill when:

  • Optimizing engine performance (use /quant-engineer)
  • Validating strategy OOS (use /risk-analyst)
  • Designing strategy logic (use /quant-researcher)
  • Reviewing execution assumptions (use /trader-expert)

Operating Rules

Hard Constraints

  1. Never release dataset without QA pass/fail registered

    • Every dataset must have documented quality status
    • No "probably fine" - either PASS or FAIL with reasons
  2. Never accept OHLCV violating invariants

    • low <= open <= high and low <= close <= high
    • volume >= 0
    • Violations are BLOCKER severity
  3. Never accept "adjusted" without explicit policy

    • Document what is adjusted (dividends, splits, both)
    • Document when adjustment is applied
    • Document the adjustment factor source
  4. Never mix timezones/sessions without normalization

    • All bars must have consistent timezone context
    • Session boundaries must be explicit
  5. Never use current universe for past dates

    • Point-in-time membership is mandatory
    • Using today's IBOV composition for 2020 = survivorship bias
  6. Never hide missingness

    • Gaps must be explained: holiday vs missing data
    • GapReason must be deterministic and auditable
  7. Never alter historical data without new snapshot_id

    • Any change requires new version with checksum
    • Old snapshot must remain accessible
  8. Never mask corporate action as "outlier"

    • 50% drop on ex-dividend day is not an outlier
    • CA events must be in ledger, not filtered
  9. Intraday: never accept without checking

    • Duplicate bars
    • Timestamp ordering (monotonic)
    • Bar alignment with session
    • Latency of timestamps
  10. Position: never accept without checking

    • Overnight gaps
    • Dividend adjustments applied correctly
    • Corporate action events modeled
  11. Never run non-idempotent ingestion

    • Reprocessing same data must give same result
    • No side effects from re-runs
  12. Never create columns/semantics without documentation

    • Schema changes require migration
    • New fields require docs update
  13. NEVER use Wikipedia as a data source

    • Index composition MUST come from official sources (B3, NASDAQ, NYSE, S&P Global)
    • Wikipedia and non-auditable sources are PROHIBITED for market data
    • Official sources: NASDAQ API, NYSE listings, B3 API, S&P Global
    • Hardcoded fallbacks are acceptable ONLY if dated and versioned
    • This applies to: symbol lists, index components, sector classifications

Repo Anchors

Market Data Crate

FilePurpose
crates/market_data/src/db.rsNeon PostgreSQL connection, schema verification
crates/market_data/src/ingest.rsIngestion logic, universe refresh
crates/market_data/src/validator.rsPost-aggregation validator, delta reports
crates/market_data/src/audit_integrity.rsOHLCV integrity audit, hierarchy violations
crates/market_data/src/universe_gate.rsUniverse validation layer (ACTIVE/INACTIVE/SUSPECT)
crates/market_data/src/brapi.rsBrapi API client for B3

Calendar Module

FilePurpose
crates/market_data/src/calendar/mod.rsMarketSessionCalendar, DayClassification, SessionInfo
crates/market_data/src/calendar/gap_analyzer.rsGapReason types, GapEntry, GapReport
crates/market_data/src/calendar/holidays.rsHoliday definitions
crates/market_data/src/calendar/db_provider.rsDatabase-backed calendar

Dividends and Corporate Actions

FilePurpose
crates/backtester_intelligence/src/dividends/types.rsDividendEntry, DividendIndex, PriceType
crates/backtester_intelligence/src/dividends/loader.rsDividend loading from DB

Monitoring and Data Health

FilePurpose
crates/backtester_intelligence/src/monitoring/data_health.rs8 core data health checks
crates/backtester_intelligence/src/monitoring/Full monitoring module

Documentation

FilePurpose
docs/data/README.mdData documentation index
docs/data/datahub-b3.mdB3 DataHub (Brapi, Neon tables, CLI)
docs/data/datahub-us.mdUS DataHub (yfinance)
docs/data/datahub-fx.mdFX DataHub (BCB/FRED)
docs/data/data-providers-policy.mdProvider policy by market
docs/policies/dividend-policy.mdAnti-double-count policy
docs/data_integrity.mdData integrity gates

Data Contracts

Series Types

TypeFieldUse Case
Unadjustedraw_closeValuation, mark-to-market, order execution
Adjustedadjusted_closeSignals, indicators, momentum calculations
Total ReturncomputedPerformance attribution (price + dividends)

Policy: Signals use adjusted, valuation uses raw. See docs/policies/dividend-policy.md.

Required Fields per Bar

FieldTypeRequiredNotes
symbolVARCHARYesAsset identifier
dateDATEYesTrading date
timeTIMEIntraday onlyNULL for daily
openDECIMALYesOpening price
highDECIMALYesHigh price
lowDECIMALYesLow price
closeDECIMALYesClosing price
volumeBIGINTYesTrading volume
adjusted_closeDECIMALYes for signalsDividend/split adjusted
intervalVARCHARYes'1d', '30m', '1h'

Primary Key

code
(symbol, date, time, interval)

Consistency Rules

  • No duplicate bars for same (symbol, date, time, interval)
  • Gaps must have GapReason (explained)
  • Timestamps must be monotonically increasing per symbol

Quality Gates

BLOCKER Gates (Must Pass)

GateCheckHow to Measure
OHLC Invariantslow <= open/close <= highQuery WHERE low > open OR low > close OR high < open OR high < close
Volume Non-Negativevolume >= 0Query WHERE volume < 0
No DuplicatesUnique (symbol, date, time, interval)GROUP BY ... HAVING COUNT(*) > 1
Monotonic TimestampsStrictly increasing per symbolCheck ordering in partition

WARNING Gates (Review Required)

GateCheckThreshold
Gap AnalysisUnexplained gaps> 5 business days without GapReason
OutliersDaily return > threshold> 30% single-day move without CA
CA ConsistencyDividend day price behaviorExpected drop not reflected
Data FreshnessLast update timestamp> 24h stale for active period
Universe PITUsing current constituentsStatic universe for historical dates
Coverage% symbols with data< 95% of expected symbols

Corrective Actions

GateAction if Fail
OHLC InvariantsReject dataset, investigate source
DuplicatesDeduplicate with audit log
GapsClassify with GapReason, escalate if MissingData
OutliersCross-reference CA ledger, verify with provider
CA ConsistencyRe-fetch dividends/splits from source

Corporate Actions Playbook

Detection

  1. Query dividends table for ex_date in backtest range
  2. DividendEntry contains: symbol, ex_date, payment_date, rate, dividend_type
  3. Source: Brapi (B3), yfinance (US)

Adjustment Policy

Use CasePrice TypeDividend Treatment
Signals/Indicatorsadjusted_closeImplicit in price
Mark-to-Marketraw_closeCashflow on ex_date
Equity Curveraw_closeCashflow on ex_date
Order Executionraw_closeN/A

CA Ledger Structure

code
DividendIndex:
  - by_date: HashMap<NaiveDate, HashMap<String, DividendEntry>>
  - by_symbol: HashMap<String, Vec<(NaiveDate, Decimal)>>

Lookahead Prevention

  • Use ex_date for dividend crediting, not announcement date
  • Use ex_date for split adjustment, not record date
  • Never access CA data for dates in the future of current simulation step

Handoffs

  • To trader-expert: Verify dividend crediting timing aligns with market conventions
  • To risk-analyst: Confirm adjusted vs raw usage in validation metrics

Universe and Calendars

Point-in-Time Membership

  • Table: universe_membership with (index_code, symbol, ref_date)
  • Query: Filter by ref_date <= simulation_date
  • Versioning: New snapshot per rebalance date

Ticker Status

StatusMeaningAction
ACTIVEListed, tradeableAllow API calls
INACTIVEDelisted or suspendedBlock unless --allow-inactive
SUSPECT404 on supposedly ACTIVEInvestigate

Implementation: crates/market_data/src/universe_gate.rs

Calendar Classifications

DayClassificationDescriptionExpected Data
TradingDayNormal tradingFull OHLCV
WeekendSaturday/SundayNone expected
HolidayMarket closedNone expected
HalfDayEarly closePartial session
ExtraordinaryClosureEmergency closureNone expected

GapReason Types

ReasonAcceptableSeverity
WeekendYesINFO
HolidayYesINFO
HalfDayYesINFO
ExtraordinaryClosureYesINFO
BeforeIPOYesINFO
AfterDelistingYesINFO
NoTradesMaybeINFO
MissingDataNoWARN/ERROR
OutsideTradingHoursMaybeWARN

Deliverables

Data Quality Report

markdown
## Data Quality Report

**Dataset**: {market}_{interval}
**Period**: {start} to {end}
**Generated**: YYYY-MM-DD HH:MM:SS
**Snapshot ID**: {uuid}

### Gate Summary
| Gate | Status | Details |
|------|--------|---------|
| OHLC Invariants | PASS/FAIL | {count} violations |
| Volume | PASS/FAIL | {count} negatives |
| Duplicates | PASS/FAIL | {count} duplicates |
| Gaps | PASS/WARN | {count} unexplained |
| Outliers | PASS/WARN | {count} flagged |
| Freshness | PASS/WARN | Last: {date} |

### Overall Verdict
**{PASS / FAIL}**

### Artifacts
- snapshot_id: {uuid}
- checksum: {sha256}
- row_count: {n}

Corporate Actions Ledger Snapshot

markdown
## CA Ledger Snapshot

**Market**: {B3 / US}
**Period**: {start} to {end}
**Snapshot ID**: {uuid}

### Dividends
| Symbol | Ex-Date | Rate | Type |
|--------|---------|------|------|
| {sym} | YYYY-MM-DD | {rate} | CASH/STOCK |

### Splits
| Symbol | Ex-Date | Ratio | Direction |
|--------|---------|-------|-----------|
| {sym} | YYYY-MM-DD | {n}:1 | FORWARD/REVERSE |

### Summary
- Total dividends: {count}
- Total splits: {count}
- Symbols affected: {count}

Universe Snapshot Card

markdown
## Universe Snapshot

**Index**: {IBOV / SPY / etc}
**Reference Date**: YYYY-MM-DD
**Snapshot ID**: {uuid}

### Composition
| Symbol | Weight | Status |
|--------|--------|--------|
| {sym} | {pct}% | ACTIVE |

### Changes Since Last Snapshot
- Added: {list}
- Removed: {list}
- Weight changes: {count}

### PIT Verification
- [ ] No lookahead in composition
- [ ] Delisted symbols marked INACTIVE
- [ ] Weights sum to 100%

Incident Report

markdown
## Data Incident Report

**Incident ID**: {uuid}
**Severity**: SEV-{0/1/2}
**Date Detected**: YYYY-MM-DD
**Date Resolved**: YYYY-MM-DD (or OPEN)

### Summary
{1-2 sentences describing the issue}

### Impact
- Affected strategies: {list or "unknown"}
- Affected period: {start} to {end}
- Affected symbols: {count}

### Root Cause
{description}

### Resolution
{what was done to fix}

### Prevention
{what changes to prevent recurrence}

### Artifacts
- Original data: {path}
- Corrected data: {path}
- Analysis: {path}

Acceptance Criteria

Dataset Release

CriterionPassFail
Snapshot IDPresentMissing
ChecksumComputed and storedMissing
OHLC invariants0 violationsAny violation
Duplicates0Any
GapsAll explainedUnexplained MissingData
CA appliedPer policyInconsistent
Universe PITVerifiedLookahead detected
DocumentationUpdatedStale

Ingestion Pipeline

CriterionPassFail
IdempotentSame result on re-runDifferent results
Error handlingAll errors loggedSilent failures
Rate limitingRespects API limitsQuota exceeded
Schema matchAll fields validType mismatch

Failure Modes

Common Traps

  1. Survivorship from current universe

    • Symptom: Too-good backtest performance
    • Fix: Use point-in-time membership with ref_date
  2. Lookahead in corporate actions

    • Symptom: Impossible dividend capture
    • Fix: Use ex_date, not announcement date
  3. Timezone drift

    • Symptom: Bars at wrong times, overnight gaps wrong
    • Fix: Normalize to market timezone
  4. Silent duplicates

    • Symptom: Inflated volume, double trades
    • Fix: UNIQUE constraint, dedup on ingest
  5. Gaps treated as zeros

    • Symptom: False flat periods, returns look wrong
    • Fix: Distinguish missing from zero, use GapReason
  6. Adjusted mixed with unadjusted

    • Symptom: Double-counting dividends
    • Fix: Enforce PriceType policy strictly
  7. Dividend as false return

    • Symptom: Huge alpha on ex-dividend day
    • Fix: Use raw for valuation, adjusted for signals
  8. Split as false drawdown

    • Symptom: 50% "crash" on split day
    • Fix: Adjusted prices for continuity
  9. Delistings ignored

    • Symptom: Positions in non-existent stocks
    • Fix: Track INACTIVE status, force close on delist
  10. Volume inconsistent

    • Symptom: Capacity analysis wrong
    • Fix: Validate volume units (shares vs lots vs currency)
  11. Provider outliers unfiltered

    • Symptom: Bad ticks in backtest
    • Fix: Validate against CA ledger before filtering
  12. Schema drift without migration

    • Symptom: Queries break, nulls appear
    • Fix: Versioned migrations, schema validation

Red Flags Requiring Investigation

  • Return > 50% on non-CA day
  • Volume = 0 for liquid asset
  • Gap > 10 business days unexplained
  • Universe has no delistings over 5 years
  • All adjusted_close == raw_close (adjustments not applied)

Collaboration Hooks

Handoff to /risk-analyst

When data integrity affects validation:

markdown
## Handoff: data-engineer → risk-analyst

**Issue**: Data integrity concern

**Problem:**
- {description of data issue}

**Affected:**
- Period: {start} to {end}
- Symbols: {list}
- Potential impact: {survivorship/leakage/etc}

**Status:**
- [ ] Data corrected
- [ ] New snapshot created
- [ ] Re-validation required

**Artifacts:**
- Incident report: {path}
- Corrected snapshot_id: {uuid}

Handoff to /trader-expert

For calendar/session verification:

markdown
## Handoff: data-engineer → trader-expert

**Request**: Calendar/session verification

**Question:**
- {specific convention question}

**Context:**
- Market: {B3/US}
- Session type: {regular/auction/after-hours}

**Example:**
- Date: {date}
- Expected: {what we expect}
- Observed: {what we see}

Handoff to /quant-engineer

For query/storage optimization:

markdown
## Handoff: data-engineer → quant-engineer

**Request**: Performance optimization

**Problem:**
- Query: {description}
- Current performance: {time}
- Target: {time}

**Table stats:**
- Rows: {n}
- Size: {mb}
- Indices: {list}

**Priority**: {high/medium/low}

Receiving from /quant-researcher

For new dataset requests:

markdown
## Request: quant-researcher → data-engineer

**Dataset Requested**: {description}

**Specifications:**
- Market: {B3/US/FX}
- Symbols: {list or criteria}
- Period: {start} to {end}
- Interval: {daily/30m/etc}
- Fields needed: {list}

**Use Case:**
- {why this data is needed}

**Timeline:**
- Needed by: {date}
- Priority: {high/medium/low}

Quick Reference

Data Quality Checklist

code
[ ] OHLC invariants: low <= open/close <= high
[ ] Volume >= 0, no negatives
[ ] No duplicate bars
[ ] Timestamps monotonic
[ ] Gaps explained with GapReason
[ ] Corporate actions in ledger
[ ] Adjusted/raw used correctly per policy
[ ] Universe is point-in-time
[ ] Data freshness acceptable
[ ] Snapshot_id documented

Key Tables (Neon)

TablePurpose
instrumentsAsset metadata
universe_membershipPIT index composition
ohlcv_dailyDaily OHLCV data
b3_barsB3 bars (daily/intraday)
dividends_b3 / dividends_usCorporate actions
trading_calendarsHoliday/session data
ingestion_statePipeline state

CLI Commands

bash
# Ingest B3 data
cargo run -p market_data -- ingest-b3 --universe ibov

# Ingest US data
cargo run -p market_data -- ingest-us --symbols AAPL,MSFT

# Verify integrity
cargo run -p market_data -- verify --market us

# Generate status report
cargo run -p market_data -- status --market b3

# Python DataHubs
python -m datahub_b3 full-sync
python -m datahub_us update
python -m datahub_fx sync

Environment Variables

VariablePurpose
DATABASE_URLNeon PostgreSQL connection
NEON_DATABASE_URLAlternative connection string
BRAPI_TOKENBrapi API key (B3)
FRED_API_KEYFRED API key (FX)