Pointline Research Guide
Guide LLM agents through efficient, correct usage of the Pointline data lake for quantitative research.
0. Project Bootstrap (Before Analysis)
When operating in this repository, start with these defaults:
- •Assume working directory is repo root.
- •Use editable install and dev dependencies when Python imports fail.
- •Prefer Query API for exploration and Core API for explicit reproducibility.
# Environment bootstrap uv sync --all-extras source .venv/bin/activate
First response contract for user requests:
- •confirm exchange, symbol, date range, and objective
- •run discovery (
list_exchanges,list_symbols,summarize_symbol) - •run a small probe query before full analysis
- •then run requested analysis with PIT-safe defaults
1. Profile Selection (Context Switching)
Step 1: CLASSIFY the user's request into one of these profiles. Step 2: ADOPT the specific priorities and checks for that profile.
| Profile | Triggers | Key Data | Reference |
|---|---|---|---|
| HFT (Microstructure) | "latency", "L3", "queue", "ticks", "imbalance" | szse_l3_orders, book_snapshot_25 | profile_hft_microstructure.md |
| MFT (Alpha/StatArb) | "rebalance", "funding", "RSI", "portfolio", "4h" | kline_1h, derivative_ticker | profile_mft_alpha.md |
| Execution (TCA) | "slippage", "impact", "fill rate", "vwap" | trades, quotes | profile_execution.md |
Quick Start Workflow
ALWAYS follow this sequence:
- •Discover → Check what data exists (Exchanges, Symbols)
- •Validate → Check
research.data_coverageand then run a small probe query - •Query → Load data with Query API (default)
- •Analyze → Apply analysis patterns with PIT correctness
from pointline import research
# Step 1: Discover
exchanges = research.list_exchanges(asset_class="crypto-derivatives")
symbols = research.list_symbols(exchange="binance-futures", base_asset="BTC")
# Step 2: Validate (The "Sanity Check")
# NEVER skip this. Loading empty data is the root cause of many failures.
coverage = research.data_coverage("binance-futures", "BTCUSDT")
if not coverage["trades"]["available"]:
raise ValueError("Trades table not available for BTCUSDT on binance-futures")
# Step 3: Probe query (fast sanity check on the requested window)
from pointline.research import query
probe = query.trades(
"binance-futures", "BTCUSDT", "2024-05-01", "2024-05-01T00:05:00Z", decoded=True, lazy=False
)
if probe.is_empty():
raise ValueError("No rows found in probe window; check date range, market hours, or data availability.")
# Step 4: Full query
trades = query.trades("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
# Step 5: Analyze (use ts_local_us for PIT correctness)
import polars as pl
trades = trades.sort("ts_local_us") # Arrival order (PIT correct)
vwap = (trades["price"] * trades["qty"]).sum() / trades["qty"].sum()
Sanity Check Protocols (Self-Correction)
Act as a "Research Partner", not just a tool.
- •
Pre-Flight Check: Before running any heavy computation (VWAP, Backtest), write code to verify:
- •Does the symbol exist? (
list_symbols) - •Is the table/symbol supported? (
data_coverage) - •Are there rows in a short target-window probe query?
- •Is the date range valid? (Start < End)
- •Does the symbol exist? (
- •
Empty Data Handling:
- •IF a query returns 0 rows:
- •DO NOT simply print "Empty DataFrame".
- •DO investigate why: "Data missing for date 2024-05-01. Checked coverage: False. Possible holiday or outage?"
- •
Domain Validation:
- •If a user asks for "Binance Spot Funding Rates", CORRECT THEM: "Binance Spot does not have funding rates. Did you mean Binance Futures?"
- •If a user asks for "SZSE data on Sunday", WARN THEM: "SZSE is closed on weekends. Data will be empty."
Discovery API (Start Here)
CRITICAL: Always check data availability before querying.
List Exchanges
from pointline import research # All exchanges exchanges = research.list_exchanges() # Filter by asset class crypto_spot = research.list_exchanges(asset_class="crypto-spot") crypto_derivatives = research.list_exchanges(asset_class="crypto-derivatives") chinese_stocks = research.list_exchanges(asset_class="stocks-cn")
Asset Classes:
- •
"crypto"- All crypto exchanges - •
"crypto-spot"- Spot only (binance, coinbase, kraken, etc.) - •
"crypto-derivatives"- Derivatives only (binance-futures, deribit, bybit, etc.) - •
"stocks-cn"- Chinese stocks (szse, sse)
List Symbols
# All symbols on exchange symbols = research.list_symbols(exchange="binance-futures") # Filter by base asset btc_symbols = research.list_symbols(exchange="binance-futures", base_asset="BTC") # Fuzzy search eth_symbols = research.list_symbols(search="ETH")
Check Data Coverage
# Check what data exists for a symbol
coverage = research.data_coverage("binance-futures", "BTCUSDT")
print(f"Trades: {coverage['trades']['available']}")
print(f"Quotes: {coverage['quotes']['available']}")
print(f"Book snapshots: {coverage['book_snapshot_25']['available']}")
data_coverage() is a table/symbol availability check, not a guaranteed per-date row-count check.
Always add a short probe query on the target date window before expensive analysis.
Summarize Symbol
# Rich summary with metadata and coverage
research.summarize_symbol("BTCUSDT", exchange="binance-futures")
# Prints: exchange, symbol_id, price_increment, data availability, etc.
API Selection Guide
Default: Query API (90% of Use Cases)
Use Query API for:
- •Exploratory analysis
- •Quick data checks
- •Ad-hoc investigations
- •Most backtests
Benefits:
- •Automatic symbol resolution
- •Automatic decoding (decoded=True)
- •ISO date strings or datetime objects
- •Simple, concise code
from pointline.research import query
# One-liner - automatic everything
trades = query.trades("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
quotes = query.quotes("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
book = query.book_snapshot_25("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
Advanced: Core API (10% of Use Cases)
Use Core API ONLY when user explicitly needs:
- •Reproducible research with symbol_id tracking
- •Explicit control over symbol resolution
- •Raw integer data (no decoding)
- •Production-grade workflows
Requires:
- •Manual symbol resolution (registry.find_symbol)
- •Microsecond timestamp conversion
- •Manual decoding (if needed)
from pointline import registry, research
from datetime import datetime, timezone
# Step 1: Resolve symbol_id
symbols = registry.find_symbol("BTCUSDT", exchange="binance-futures")
symbol_id = symbols["symbol_id"][0]
# Step 2: Convert timestamps
start = datetime(2024, 5, 1, tzinfo=timezone.utc)
end = datetime(2024, 5, 2, tzinfo=timezone.utc)
start_ts_us = int(start.timestamp() * 1_000_000)
end_ts_us = int(end.timestamp() * 1_000_000)
# Step 3: Load with explicit symbol_id
trades = research.load_trades(symbol_id=symbol_id, start_ts_us=start_ts_us, end_ts_us=end_ts_us)
# Step 4: Decode if needed
from pointline.tables.trades import decode_fixed_point
from pointline.dim_symbol import read_dim_symbol_table
dim_symbol = read_dim_symbol_table()
trades = decode_fixed_point(trades, dim_symbol)
Decision Tree
User asks to load data? │ ├─ Mentions "symbol_id" or "reproducibility"? → Core API ├─ Mentions "production research"? → Ask if they need explicit symbol_id control │ ├─ Yes → Core API │ └─ No → Query API └─ Default → Query API (with decoded=True)
Essential Schemas
trades
Common columns (decoded=True):
| Column | Type | Description |
|---|---|---|
ts_local_us | Int64 | Arrival timestamp (UTC, µs) - USE THIS for replay |
ts_exch_us | Int64 | Exchange timestamp (UTC, µs) |
symbol_id | Int64 | Symbol identifier |
side | UInt8 | 0=buy, 1=sell, 2=unknown |
price | Float64 | Trade price (decoded) |
qty | Float64 | Trade quantity (decoded) |
PIT correctness: Always use ts_local_us for sorting (arrival order).
trades = query.trades("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
trades = trades.sort("ts_local_us") # ✅ CORRECT - arrival order
quotes
Common columns (decoded=True):
| Column | Type | Description |
|---|---|---|
ts_local_us | Int64 | Arrival timestamp (UTC, µs) |
bid_price | Float64 | Best bid (decoded) |
bid_qty | Float64 | Best bid quantity (decoded) |
ask_price | Float64 | Best ask (decoded) |
ask_qty | Float64 | Best ask quantity (decoded) |
Example: Calculate mid price
quotes = query.quotes("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
quotes = quotes.with_columns(
((pl.col("ask_price") + pl.col("bid_price")) / 2).alias("mid_price")
)
book_snapshot_25
Common columns (decoded=True):
| Column | Type | Description |
|---|---|---|
ts_local_us | Int64 | Arrival timestamp (UTC, µs) |
bid_price_0 to bid_price_24 | Float64 | Bid prices at depth 0-24 (decoded) |
bid_qty_0 to bid_qty_24 | Float64 | Bid quantities at depth 0-24 (decoded) |
ask_price_0 to ask_price_24 | Float64 | Ask prices at depth 0-24 (decoded) |
ask_qty_0 to ask_qty_24 | Float64 | Ask quantities at depth 0-24 (decoded) |
Level 0 = top of book (best bid/ask)
book = query.book_snapshot_25("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
# Calculate total bid liquidity in top 10 levels
book = book.with_columns(
pl.sum_horizontal([f"bid_qty_{i}" for i in range(10)]).alias("bid_liquidity_top10")
)
Complete schemas: See references/schemas.md for all tables (klines, derivative_ticker, szse_l3_orders, etc.)
Critical Anti-Patterns
❌ DON'T: Use Core API for Simple Queries
# ❌ BAD - Unnecessary complexity
from pointline import registry, research
symbols = registry.find_symbol("BTCUSDT", exchange="binance-futures")
symbol_id = symbols["symbol_id"][0]
start_ts_us = 1714521600000000
trades = research.load_trades(symbol_id=symbol_id, start_ts_us=start_ts_us, end_ts_us=...)
✅ DO: Use Query API
# ✅ GOOD - Simple and correct
from pointline.research import query
trades = query.trades("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
❌ DON'T: Manually Convert Timestamps
# ❌ BAD - Error-prone from datetime import datetime, timezone start = datetime(2024, 5, 1, tzinfo=timezone.utc) start_ts_us = int(start.timestamp() * 1_000_000) # Easy to mess up
✅ DO: Use ISO Strings
# ✅ GOOD - Query API accepts ISO strings
trades = query.trades("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
❌ DON'T: Use Exchange Time for Replay
# ❌ BAD - Lookahead bias!
trades = trades.sort("ts_exch_us") # Exchange time
✅ DO: Use Arrival Time
# ✅ GOOD - PIT correct
trades = trades.sort("ts_local_us") # Arrival time
❌ DON'T: Skip Data Discovery
# ❌ BAD - Assumes data exists
trades = query.trades("binance-futures", "UNKNOWN_SYMBOL", "2024-05-01", "2024-05-02")
# Error: Symbol not found!
✅ DO: Check Coverage First
# ✅ GOOD - Verify before loading
coverage = research.data_coverage("binance-futures", "BTCUSDT")
if coverage["trades"]["available"]:
trades = query.trades("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
Common Workflows
Spread Analysis
from pointline.research import query
import polars as pl
# Load quotes
quotes = query.quotes("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
# Calculate spread
quotes = quotes.with_columns([
(pl.col("ask_price") - pl.col("bid_price")).alias("spread"),
((pl.col("ask_price") + pl.col("bid_price")) / 2).alias("mid_price"),
])
# Spread in basis points
quotes = quotes.with_columns(
(pl.col("spread") / pl.col("mid_price") * 10000).alias("spread_bps")
)
print(quotes.select(pl.col("spread_bps").mean()))
VWAP Calculation
from pointline.research import query
import polars as pl
trades = query.trades("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
# Cumulative VWAP (PIT correct)
trades = trades.with_columns([
((pl.col("price") * pl.col("qty")).cum_sum() / pl.col("qty").cum_sum()).alias("vwap")
])
As-of Join (Trades with Quotes)
from pointline.research import query
trades = query.trades("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
quotes = query.quotes("binance-futures", "BTCUSDT", "2024-05-01", "2024-05-02", decoded=True)
# Match each trade with most recent quote (PIT correct)
trades_with_quotes = trades.join_asof(
quotes.select(["ts_local_us", "bid_price", "ask_price"]),
on="ts_local_us",
strategy="backward", # Use PAST quote only
)
More patterns: See references/analysis_patterns.md for:
- •Order flow metrics (trade imbalance, order book imbalance)
- •Market microstructure (price impact, effective spread)
- •Execution quality (slippage, fill rates)
Reproducibility Principles
Point-in-Time (PIT) Correctness
Default timeline: ts_local_us (arrival time), NOT ts_exch_us (exchange time)
Rationale: Live trading reacts to arrival time. Using exchange time creates lookahead bias.
# ✅ CORRECT - Arrival order
trades = trades.sort("ts_local_us")
# ❌ INCORRECT - Exchange order (lookahead bias)
trades = trades.sort("ts_exch_us")
Deterministic Ordering
Canonical ordering: (ts_local_us, file_id, file_line_number) ascending
Pointline guarantees stable ordering for reproducible outputs.
# ✅ CORRECT - Deterministic
trades = trades.sort("ts_local_us", "file_id", "file_line_number")
# ❌ INCORRECT - Arbitrary tie-breaking
trades = trades.sort("ts_local_us")
Avoiding Lookahead Bias
Use as-of joins (backward strategy):
# ✅ CORRECT - Only past data trades_with_quotes = trades.join_asof(quotes, on="ts_local_us", strategy="backward") # ❌ INCORRECT - May use future data trades_with_quotes = trades.join(quotes, on="ts_local_us", how="left")
Use cumulative calculations:
# ✅ CORRECT - Expanding window
vwap = (pl.col("price") * pl.col("qty")).cum_sum() / pl.col("qty").cum_sum()
# ❌ INCORRECT - Uses future data
vwap = pl.sum("price * qty") / pl.sum("qty") # Whole period!
Complete guide: See references/best_practices.md for:
- •Symbol resolution workflow (SCD Type 2)
- •Fixed-point encoding details
- •Partition pruning optimization
- •Experiment logging standards
Supported Data
26+ exchanges across:
- •Crypto Spot: binance, coinbase, kraken, okx, huobi, gate, bitfinex, bitstamp, gemini, crypto-com, kucoin
- •Crypto Derivatives: binance-futures, deribit, bybit, okx-futures, bitmex, ftx, dydx
- •Chinese Stocks: szse (SZSE), sse (SSE) with Level 3 order book data
Tables:
- •
trades- Trade executions - •
quotes- Best bid/ask (Level 1) - •
book_snapshot_25- Order book depth (25 levels) - •
kline_1h- OHLCV candlesticks (1-hour) - •
derivative_ticker- Funding rates, OI, mark/index prices - •
szse_l3_orders- SZSE order placements/cancellations - •
szse_l3_ticks- SZSE trade executions
Discovery first: Always use research.list_exchanges(), research.list_symbols(), and research.data_coverage() to check availability.
Reference Files
Load these when needed for detailed guidance:
- •references/analysis_patterns.md - Common quant analysis patterns (spreads, volume, order flow, market microstructure, execution quality)
- •references/best_practices.md - Reproducibility principles (PIT correctness, deterministic ordering, symbol resolution, avoiding lookahead bias)
- •references/schemas.md - Comprehensive table schemas with all fields and encoding details
- •references/exchange_quirks.md - "Tribal Knowledge" about exchange-specific behaviors (funding intervals, auction sessions, holiday calendars, etc.)
- •references/ml_features.md - Feature engineering guide for ML models (Order Book Imbalance, Trade Flow Toxicity, Volatility)
- •references/profile_hft_microstructure.md - HFT Context: Latency, queue simulation, packet gaps, L3 data.
- •references/profile_mft_alpha.md - MFT Context: Factor modeling, funding rates, portfolio rebalancing.
- •references/profile_execution.md - TCA Context: Slippage, market impact, execution quality.