AgentSkillsCN

incremental-fetch

从API构建弹性数据摄取管道。在创建从外部API(Twitter、交易所、任何REST API)获取分页数据的脚本时使用,需要跟踪进度、避免重复、处理速率限制,并支持增量更新和历史回填。触发条件:“从API摄取数据”、“拉取推文”、“获取历史数据”、“从X同步”、“构建数据管道”、“不重新下载”、“恢复下载”、“回填旧数据”。不适用于:简单的单次API调用、websocket/流连接、文件下载或没有分页的API。

SKILL.md
--- frontmatter
name: incremental-fetch
description: "Build resilient data ingestion pipelines from APIs. Use when creating scripts that fetch paginated data from external APIs (Twitter, exchanges, any REST API) and need to track progress, avoid duplicates, handle rate limits, and support both incremental updates and historical backfills. Triggers: 'ingest data from API', 'pull tweets', 'fetch historical data', 'sync from X', 'build a data pipeline', 'fetch without re-downloading', 'resume the download', 'backfill older data'. NOT for: simple one-shot API calls, websocket/streaming connections, file downloads, or APIs without pagination."

Incremental Fetch

Build data pipelines that never lose progress and never re-fetch existing data.

The Two Watermarks Pattern

Track TWO cursors to support both forward and backward fetching:

WatermarkPurposeAPI Parameter
newest_idFetch new data since last runsince_id
oldest_idBackfill older datauntil_id

A single watermark only fetches forward. Two watermarks enable:

  • Regular runs: fetch NEW data (since newest_id)
  • Backfill runs: fetch OLD data (until oldest_id)
  • No overlap, no gaps

Critical: Data vs Watermark Saving

These are different operations with different timing:

WhatWhen to SaveWhy
Data recordsAfter EACH pageResilience: interrupted on page 47? Keep 46 pages
WatermarksONCE at end of runCorrectness: only commit progress after full success
code
fetch page 1 → save records → fetch page 2 → save records → ... → update watermarks

Workflow Decision Tree

code
First run (no watermarks)?
├── YES → Full fetch (no since_id, no until_id)
└── NO → Backfill flag set?
    ├── YES → Backfill mode (until_id = oldest_id)
    └── NO → Update mode (since_id = newest_id)

Implementation Checklist

  1. Database: Create ingestion_state table (see patterns.md)
  2. Fetch loop: Insert records immediately after each API page
  3. Watermark tracking: Track newest/oldest IDs seen in this run
  4. Watermark update: Save watermarks ONCE at end of successful run
  5. Retry: Exponential backoff with jitter
  6. Rate limits: Wait for reset or skip and record for next run

Pagination Types

This pattern works best with ID-based pagination (numeric IDs that can be compared). For other pagination types:

TypeAdaptation
Cursor/tokenStore cursor string instead of ID; can't compare numerically
TimestampUse last_timestamp column; compare as dates
Offset/limitStore page number; resume from last saved page

See references/patterns.md for schemas and code examples.