AgentSkillsCN

pointline-research

Pointline 是一款高性能、精准至时间点(PIT)的数据湖,专为量化交易研究而打造。在处理高频交易/加密货币/股票市场数据时使用:(1) 探索可用数据(交易所、标的资产、覆盖范围);(2) 加载成交记录、行情报价及订单簿数据;(3) 分析市场微观结构(价差、成交量、订单流);(4) 运行可复现的回测;(5) 验证数据质量。支持加密货币(26+ 个交易所的现货与衍生品)以及中国股票(深交所/上交所三级数据)。重点强调 PIT 的准确性、确定性排序,同时避免前瞻偏差。

SKILL.md
--- frontmatter
name: pointline-research
description: "Guide for using Pointline, a high-performance point-in-time (PIT) accurate data lake for quantitative trading research. Use when working with HFT/crypto/stock market data for: (1) Discovering available data (exchanges, symbols, coverage), (2) Loading trades/quotes/orderbook data, (3) Analyzing market microstructure (spreads, volume, order flow), (4) Running reproducible backtests, (5) Validating data quality. Supports crypto (spot/derivatives on 26+ exchanges) and Chinese stocks (SZSE/SSE Level 3 data). Emphasizes PIT correctness, deterministic ordering, and avoiding lookahead bias."

Pointline Research Guide

Guide LLM agents through efficient, correct usage of the Pointline data lake for quantitative research.

0. Project Bootstrap (Before Analysis)

When operating in this repository, start with these defaults:

  1. Assume working directory is repo root.
  2. Use editable install and dev dependencies when Python imports fail.
  3. Prefer Query API for exploration and Core API for explicit reproducibility.
bash
# Environment bootstrap
uv sync --all-extras
source .venv/bin/activate

First response contract for user requests:

  • confirm exchange, symbol, date range, and objective
  • run discovery (list_exchanges, list_symbols, summarize_symbol)
  • run a small probe query before full analysis
  • then run requested analysis with PIT-safe defaults

1. Profile Selection (Context Switching)

Step 1: CLASSIFY the user's request into one of these profiles. Step 2: ADOPT the specific priorities and checks for that profile.

ProfileTriggersKey DataReference
HFT (Microstructure)"latency", "L3", "queue", "ticks", "imbalance"szse_l3_orders, book_snapshot_25profile_hft_microstructure.md
MFT (Alpha/StatArb)"rebalance", "funding", "RSI", "portfolio", "4h"kline_1h, derivative_tickerprofile_mft_alpha.md
Execution (TCA)"slippage", "impact", "fill rate", "vwap"trades, quotesprofile_execution.md

Quick Start Workflow

ALWAYS follow this sequence:

  1. Discover → Check what data exists (Exchanges, Symbols)
  2. Validate → Check research.data_coverage and then run a small probe query
  3. Query → Load data with Query API (default)
  4. Analyze → Apply analysis patterns with PIT correctness
python
from pointline import research

# Step 1: Discover
exchanges = research.list_exchanges(asset_class="crypto-derivatives")
symbols = research.list_symbols(exchange="binance-futures", base_asset="BTC")

# Step 2: Validate (The "Sanity Check")
# NEVER skip this. Loading empty data is the root cause of many failures.
coverage = research.data_coverage("binance-futures", "BTCUSDT")
if not coverage["trades"]["available"]:
    raise ValueError("Trades table not available for BTCUSDT on binance-futures")

# Step 3: Probe query (fast sanity check on the requested window)
from pointline.research import query
probe = query.trades(
    "binance-futures", "BTCUSDT", "2024-05-01", "2024-05-01T00:05:00Z", decoded=True, lazy=False
)
if probe.is_empty():
    raise ValueError("No rows found in probe window; check date range, market hours, or data availability.")

# Step 4: Full query
trades = query.trades("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)

# Step 5: Analyze (use ts_local_us for PIT correctness)
import polars as pl
trades = trades.sort("ts_local_us")  # Arrival order (PIT correct)
vwap = (trades["price"] * trades["qty"]).sum() / trades["qty"].sum()

Sanity Check Protocols (Self-Correction)

Act as a "Research Partner", not just a tool.

  1. Pre-Flight Check: Before running any heavy computation (VWAP, Backtest), write code to verify:

    • Does the symbol exist? (list_symbols)
    • Is the table/symbol supported? (data_coverage)
    • Are there rows in a short target-window probe query?
    • Is the date range valid? (Start < End)
  2. Empty Data Handling:

    • IF a query returns 0 rows:
    • DO NOT simply print "Empty DataFrame".
    • DO investigate why: "Data missing for date 2024-05-01. Checked coverage: False. Possible holiday or outage?"
  3. Domain Validation:

    • If a user asks for "Binance Spot Funding Rates", CORRECT THEM: "Binance Spot does not have funding rates. Did you mean Binance Futures?"
    • If a user asks for "SZSE data on Sunday", WARN THEM: "SZSE is closed on weekends. Data will be empty."

Discovery API (Start Here)

CRITICAL: Always check data availability before querying.

List Exchanges

python
from pointline import research

# All exchanges
exchanges = research.list_exchanges()

# Filter by asset class
crypto_spot = research.list_exchanges(asset_class="crypto-spot")
crypto_derivatives = research.list_exchanges(asset_class="crypto-derivatives")
chinese_stocks = research.list_exchanges(asset_class="stocks-cn")

Asset Classes:

  • "crypto" - All crypto exchanges
  • "crypto-spot" - Spot only (binance, coinbase, kraken, etc.)
  • "crypto-derivatives" - Derivatives only (binance-futures, deribit, bybit, etc.)
  • "stocks-cn" - Chinese stocks (szse, sse)

List Symbols

python
# All symbols on exchange
symbols = research.list_symbols(exchange="binance-futures")

# Filter by base asset
btc_symbols = research.list_symbols(exchange="binance-futures", base_asset="BTC")

# Fuzzy search
eth_symbols = research.list_symbols(search="ETH")

Check Data Coverage

python
# Check what data exists for a symbol
coverage = research.data_coverage("binance-futures", "BTCUSDT")
print(f"Trades: {coverage['trades']['available']}")
print(f"Quotes: {coverage['quotes']['available']}")
print(f"Book snapshots: {coverage['book_snapshot_25']['available']}")

data_coverage() is a table/symbol availability check, not a guaranteed per-date row-count check. Always add a short probe query on the target date window before expensive analysis.

Summarize Symbol

python
# Rich summary with metadata and coverage
research.summarize_symbol("BTCUSDT", exchange="binance-futures")
# Prints: exchange, symbol_id, price_increment, data availability, etc.

API Selection Guide

Default: Query API (90% of Use Cases)

Use Query API for:

  • Exploratory analysis
  • Quick data checks
  • Ad-hoc investigations
  • Most backtests

Benefits:

  • Automatic symbol resolution
  • Automatic decoding (decoded=True)
  • ISO date strings or datetime objects
  • Simple, concise code
python
from pointline.research import query

# One-liner - automatic everything
trades = query.trades("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
quotes = query.quotes("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
book = query.book_snapshot_25("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)

Advanced: Core API (10% of Use Cases)

Use Core API ONLY when user explicitly needs:

  • Reproducible research with symbol_id tracking
  • Explicit control over symbol resolution
  • Raw integer data (no decoding)
  • Production-grade workflows

Requires:

  • Manual symbol resolution (registry.find_symbol)
  • Microsecond timestamp conversion
  • Manual decoding (if needed)
python
from pointline import registry, research
from datetime import datetime, timezone

# Step 1: Resolve symbol_id
symbols = registry.find_symbol("BTCUSDT", exchange="binance-futures")
symbol_id = symbols["symbol_id"][0]

# Step 2: Convert timestamps
start = datetime(2024, 5, 1, tzinfo=timezone.utc)
end = datetime(2024, 5, 2, tzinfo=timezone.utc)
start_ts_us = int(start.timestamp() * 1_000_000)
end_ts_us = int(end.timestamp() * 1_000_000)

# Step 3: Load with explicit symbol_id
trades = research.load_trades(symbol_id=symbol_id, start_ts_us=start_ts_us, end_ts_us=end_ts_us)

# Step 4: Decode if needed
from pointline.tables.trades import decode_fixed_point
from pointline.dim_symbol import read_dim_symbol_table
dim_symbol = read_dim_symbol_table()
trades = decode_fixed_point(trades, dim_symbol)

Decision Tree

code
User asks to load data?
│
├─ Mentions "symbol_id" or "reproducibility"? → Core API
├─ Mentions "production research"? → Ask if they need explicit symbol_id control
│   ├─ Yes → Core API
│   └─ No → Query API
└─ Default → Query API (with decoded=True)

Essential Schemas

trades

Common columns (decoded=True):

ColumnTypeDescription
ts_local_usInt64Arrival timestamp (UTC, µs) - USE THIS for replay
ts_exch_usInt64Exchange timestamp (UTC, µs)
symbol_idInt64Symbol identifier
sideUInt80=buy, 1=sell, 2=unknown
priceFloat64Trade price (decoded)
qtyFloat64Trade quantity (decoded)

PIT correctness: Always use ts_local_us for sorting (arrival order).

python
trades = query.trades("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
trades = trades.sort("ts_local_us")  # ✅ CORRECT - arrival order

quotes

Common columns (decoded=True):

ColumnTypeDescription
ts_local_usInt64Arrival timestamp (UTC, µs)
bid_priceFloat64Best bid (decoded)
bid_qtyFloat64Best bid quantity (decoded)
ask_priceFloat64Best ask (decoded)
ask_qtyFloat64Best ask quantity (decoded)

Example: Calculate mid price

python
quotes = query.quotes("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
quotes = quotes.with_columns(
    ((pl.col("ask_price") + pl.col("bid_price")) / 2).alias("mid_price")
)

book_snapshot_25

Common columns (decoded=True):

ColumnTypeDescription
ts_local_usInt64Arrival timestamp (UTC, µs)
bid_price_0 to bid_price_24Float64Bid prices at depth 0-24 (decoded)
bid_qty_0 to bid_qty_24Float64Bid quantities at depth 0-24 (decoded)
ask_price_0 to ask_price_24Float64Ask prices at depth 0-24 (decoded)
ask_qty_0 to ask_qty_24Float64Ask quantities at depth 0-24 (decoded)

Level 0 = top of book (best bid/ask)

python
book = query.book_snapshot_25("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)

# Calculate total bid liquidity in top 10 levels
book = book.with_columns(
    pl.sum_horizontal([f"bid_qty_{i}" for i in range(10)]).alias("bid_liquidity_top10")
)

Complete schemas: See references/schemas.md for all tables (klines, derivative_ticker, szse_l3_orders, etc.)


Critical Anti-Patterns

❌ DON'T: Use Core API for Simple Queries

python
# ❌ BAD - Unnecessary complexity
from pointline import registry, research
symbols = registry.find_symbol("BTCUSDT", exchange="binance-futures")
symbol_id = symbols["symbol_id"][0]
start_ts_us = 1714521600000000
trades = research.load_trades(symbol_id=symbol_id, start_ts_us=start_ts_us, end_ts_us=...)

✅ DO: Use Query API

python
# ✅ GOOD - Simple and correct
from pointline.research import query
trades = query.trades("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)

❌ DON'T: Manually Convert Timestamps

python
# ❌ BAD - Error-prone
from datetime import datetime, timezone
start = datetime(2024, 5, 1, tzinfo=timezone.utc)
start_ts_us = int(start.timestamp() * 1_000_000)  # Easy to mess up

✅ DO: Use ISO Strings

python
# ✅ GOOD - Query API accepts ISO strings
trades = query.trades("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)

❌ DON'T: Use Exchange Time for Replay

python
# ❌ BAD - Lookahead bias!
trades = trades.sort("ts_exch_us")  # Exchange time

✅ DO: Use Arrival Time

python
# ✅ GOOD - PIT correct
trades = trades.sort("ts_local_us")  # Arrival time

❌ DON'T: Skip Data Discovery

python
# ❌ BAD - Assumes data exists
trades = query.trades("binance-futures", "UNKNOWN_SYMBOL", "2024-05-01", "2024-05-02")
# Error: Symbol not found!

✅ DO: Check Coverage First

python
# ✅ GOOD - Verify before loading
coverage = research.data_coverage("binance-futures", "BTCUSDT")
if coverage["trades"]["available"]:
    trades = query.trades("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)

Common Workflows

Spread Analysis

python
from pointline.research import query
import polars as pl

# Load quotes
quotes = query.quotes("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)

# Calculate spread
quotes = quotes.with_columns([
    (pl.col("ask_price") - pl.col("bid_price")).alias("spread"),
    ((pl.col("ask_price") + pl.col("bid_price")) / 2).alias("mid_price"),
])

# Spread in basis points
quotes = quotes.with_columns(
    (pl.col("spread") / pl.col("mid_price") * 10000).alias("spread_bps")
)

print(quotes.select(pl.col("spread_bps").mean()))

VWAP Calculation

python
from pointline.research import query
import polars as pl

trades = query.trades("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)

# Cumulative VWAP (PIT correct)
trades = trades.with_columns([
    ((pl.col("price") * pl.col("qty")).cum_sum() / pl.col("qty").cum_sum()).alias("vwap")
])

As-of Join (Trades with Quotes)

python
from pointline.research import query

trades = query.trades("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
quotes = query.quotes("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)

# Match each trade with most recent quote (PIT correct)
trades_with_quotes = trades.join_asof(
    quotes.select(["ts_local_us", "bid_price", "ask_price"]),
    on="ts_local_us",
    strategy="backward",  # Use PAST quote only
)

More patterns: See references/analysis_patterns.md for:

  • Order flow metrics (trade imbalance, order book imbalance)
  • Market microstructure (price impact, effective spread)
  • Execution quality (slippage, fill rates)

Reproducibility Principles

Point-in-Time (PIT) Correctness

Default timeline: ts_local_us (arrival time), NOT ts_exch_us (exchange time)

Rationale: Live trading reacts to arrival time. Using exchange time creates lookahead bias.

python
# ✅ CORRECT - Arrival order
trades = trades.sort("ts_local_us")

# ❌ INCORRECT - Exchange order (lookahead bias)
trades = trades.sort("ts_exch_us")

Deterministic Ordering

Canonical ordering: (ts_local_us, file_id, file_line_number) ascending

Pointline guarantees stable ordering for reproducible outputs.

python
# ✅ CORRECT - Deterministic
trades = trades.sort("ts_local_us", "file_id", "file_line_number")

# ❌ INCORRECT - Arbitrary tie-breaking
trades = trades.sort("ts_local_us")

Avoiding Lookahead Bias

Use as-of joins (backward strategy):

python
# ✅ CORRECT - Only past data
trades_with_quotes = trades.join_asof(quotes, on="ts_local_us", strategy="backward")

# ❌ INCORRECT - May use future data
trades_with_quotes = trades.join(quotes, on="ts_local_us", how="left")

Use cumulative calculations:

python
# ✅ CORRECT - Expanding window
vwap = (pl.col("price") * pl.col("qty")).cum_sum() / pl.col("qty").cum_sum()

# ❌ INCORRECT - Uses future data
vwap = pl.sum("price * qty") / pl.sum("qty")  # Whole period!

Complete guide: See references/best_practices.md for:

  • Symbol resolution workflow (SCD Type 2)
  • Fixed-point encoding details
  • Partition pruning optimization
  • Experiment logging standards

Supported Data

26+ exchanges across:

  • Crypto Spot: binance, coinbase, kraken, okx, huobi, gate, bitfinex, bitstamp, gemini, crypto-com, kucoin
  • Crypto Derivatives: binance-futures, deribit, bybit, okx-futures, bitmex, ftx, dydx
  • Chinese Stocks: szse (SZSE), sse (SSE) with Level 3 order book data

Tables:

  • trades - Trade executions
  • quotes - Best bid/ask (Level 1)
  • book_snapshot_25 - Order book depth (25 levels)
  • kline_1h - OHLCV candlesticks (1-hour)
  • derivative_ticker - Funding rates, OI, mark/index prices
  • szse_l3_orders - SZSE order placements/cancellations
  • szse_l3_ticks - SZSE trade executions

Discovery first: Always use research.list_exchanges(), research.list_symbols(), and research.data_coverage() to check availability.


Reference Files

Load these when needed for detailed guidance: