AgentSkillsCN

persistent-cache-gap-filling

优化多重免疫荧光成像中的分割特征提取、SOM 聚类分析以及空间数据分析,充分挖掘细胞区室与细胞外基质(ECM)的特征信息。

SKILL.md
--- frontmatter
name: persistent-cache-gap-filling
description: "Persistent data cache with gap-filling for historical market data. Trigger when: (1) cache re-downloads complete data unnecessarily, (2) time-based cache expiry wastes API calls, (3) historical data needs incremental updates only, (4) gap-fill warnings appear when cache is sufficient."
author: Claude Code
date: 2026-02-01

Persistent Cache with Gap-Filling (v3.7.0)

Experiment Overview

ItemDetails
Date2026-01-01
GoalEliminate redundant downloads of historical data by removing time-based cache expiry
Environmentalpaca_trading/data/ modules
StatusSuccess

Context

User noticed that re-running the training notebook caused complete re-downloads of historical data even though:

  • Data was downloaded earlier the same day
  • Historical data is immutable (past candles never change)
  • Only new bars since the last download were needed

The root cause was time-based cache expiry:

  • SQLite cache (cache.py): 12-hour TTL via PERSISTED_TTL_HOURS
  • Pickle cache (caching_fetcher.py): 3-7 day expiry via cache_expiry_days

v2.8.0 Solution: Persistent Cache + Gap-Filling

Core Principle

Historical market data is immutable. Once downloaded and validated, it should persist indefinitely. Only fetch new bars to fill the gap between cache end and current time.

Changes Made

1. cache.py - SQLite Cache

python
# Before: TTL always checked
def get(self, ..., ttl_hours: int = 24):
    if created_at < ttl_cutoff or expires_at < now_ts:
        self._remove_entry(cache_key)
        return None

# After: TTL is optional (None = no expiry)
def get(self, ..., ttl_hours: Optional[int] = None):
    # Only check TTL if explicitly specified
    if ttl_hours is not None:
        # ... TTL check
    # Otherwise return cached data regardless of age

2. fetcher.py - DataFetcher

python
# Removed
PERSISTED_TTL_HOURS = 12
self._cache_ttl_hours = PERSISTED_TTL_HOURS

# Updated _load_persisted - no TTL check
def _load_persisted(self, symbol: str, timeframe: str) -> pd.DataFrame:
    # No TTL - historical data is immutable
    cached = self._cache.get(symbol, timeframe, start="", end="")
    return cached

3. caching_fetcher.py - CachingDataFetcher

python
class CachingDataFetcher:
    def get_bars(self, symbol, timeframe, lookback_days, **kwargs):
        cached_df = load_from_cache(symbol, timeframe, cache_dir=self._cache_dir)

        if cached_df is not None:
            cache_end = cached_df.index.max()

            # Check if cache covers requested range
            if cache_start <= start_dt and cache_end >= end_dt - tolerance:
                return cached_df  # Complete - no API call

            # Gap-fill: only fetch new bars
            fetch_start = cache_end + timedelta(hours=1)
            new_df = self._fetcher.get_bars(symbol, start=fetch_start, ...)

            # Merge and save
            combined = pd.concat([cached_df, new_df])
            save_to_cache(symbol, combined, ...)
            return combined

Behavior Comparison

Before (Time-Based Expiry)

code
Run 1 (10:00 AM): Fetch 4 years of data [API] -> Cache (12h TTL)
Run 2 (10:30 AM): Cache valid -> [CACHE] instant
Run 3 (11:00 PM): Cache expired -> [API] Fetch 4 years AGAIN

After (Persistent + Gap-Fill)

code
Run 1 (10:00 AM): Fetch 4 years of data [API] -> Cache (persistent)
Run 2 (10:30 AM): Cache complete -> [CACHE] instant
Run 3 (11:00 PM): Cache + gap-fill -> [GAP-FILL] Fetch 13 new bars only

Output Messages

MessageMeaning
[CACHE] AAPL: 35,040 bars (complete)Cache covers full range, no API call
[GAP-FILL] AAPL: Fetching 2026-01-01 to 2026-01-01...Fetching only new bars
[UPDATED] AAPL: 35,038 + 2 = 35,040 barsMerged new bars with cache
[API] AAPL: Fetching 1460 days...No cache, full download

Cache Statistics

New gap_fills counter added:

python
stats = fetcher.get_cache_stats()
# {
#   'cache_hits': 8,      # Returned cached data unchanged
#   'cache_misses': 2,    # No cache, full download
#   'gap_fills': 5,       # Merged new bars with cache
#   'hit_rate': 0.87      # (hits + gap_fills) / total
# }

v3.7.0 Update: gap_fill_threshold_days

Problem: Gap-fill attempts were logging alarming 401 warnings even when cache had sufficient data for training. A 3-day-old cache with 4,318 bars is perfectly fine for training, but the 2-hour tolerance triggered unnecessary API calls.

Solution: Added gap_fill_threshold_days parameter (default: 7 days)

python
# CachingDataFetcher now accepts gap_fill_threshold_days
fetcher = CachingDataFetcher(
    cache_dir='/path/to/cache',
    gap_fill_threshold_days=7,  # Only gap-fill if cache > 7 days old
)

Behavior:

  • Cache < 7 days old: Return cached data immediately, no API call
  • Cache >= 7 days old: Attempt gap-fill (may show warnings if API fails)
  • Cache with sufficient bars: Always returned regardless of age

Output Messages (v3.7.0):

MessageMeaning
[CACHE] AAPL: 4,318 bars (3d old)Cache is recent enough, no gap-fill attempted
[CACHE] AAPL: 4,318 bars (complete)Cache fully covers requested range
[GAP-FILL] AAPL: Fetching...Cache is stale, attempting to fetch new bars

Failed Attempts

ApproachResultWhy It Failed
Increase TTL to 30 daysWorked but fragileStill expires eventually, arbitrary cutoff
Check file modification timePartialDoesn't verify data completeness
2-hour tolerance for gap-fillUnnecessary warningsCache 3 days old is fine for training

Key Insights

  1. Historical data is immutable - Past candles never change, so there's no reason to re-fetch them
  2. Only the edge needs updating - New bars appear at the end of the series
  3. Time-based expiry is wrong model - For mutable data (news, weather) TTL makes sense; for historical OHLCV it's waste
  4. Completeness > freshness - Check if cache covers the requested date range, not how old the file is

Files Modified

code
alpaca_trading/data/cache.py:
  - get(): ttl_hours now Optional[int] = None (no expiry by default)

alpaca_trading/data/fetcher.py:
  - Removed PERSISTED_TTL_HOURS constant
  - _load_persisted(): No TTL check
  - _save_persisted(): Uses 10-year TTL (effectively infinite)

alpaca_trading/data/caching_fetcher.py:
  - DEFAULT_*_CACHE_EXPIRY_DAYS = None (no expiry)
  - is_cache_valid(): Just checks file exists
  - get_bars(): Gap-filling logic added
  - get_cache_stats(): Added gap_fills counter

Backward Compatibility

  • Existing .pkl cache files work unchanged
  • cache_expiry_days parameter still accepted but ignored
  • Old caches are automatically upgraded (no migration needed)

References

  • Skill: selection-data-caching - Original caching implementation (v2.5.1)
  • Skill: data-source-priority - Data fetching hierarchy
  • alpaca_trading/data/caching_fetcher.py: Gap-filling implementation