AgentSkillsCN

brain-datafield-exploration-general

提供六种经过验证的方法,用于在 WorldQuant BRAIN 平台上评估新数据集。方法涵盖覆盖率检查、非零值检测、更新频率分析、取值范围界定、集中趋势分析以及分布特性分析。当用户希望深入了解某一特定数据字段时(例如:“这个字段是什么?”“它多久更新一次?”),可使用此技能。

SKILL.md
--- frontmatter
name: brain-datafield-exploration-general
description: >-
  Provides 6 proven methods to evaluate new datasets on the WorldQuant BRAIN platform.
  Includes methods for checking coverage, non-zero values, update frequency, bounds, central tendency, and distribution.
  Use when the user wants to understand a specific datafield (e.g., "what is this field?", "how often does it update?").

6 Ways to Evaluate a New Dataset

This skill provides 6 methods to quickly evaluate a new datafield on the WorldQuant BRAIN platform. For the complete guide and detailed examples, see reference.md.

Important: Run these simulations with Neutralization: None, Decay: 0, Test Period: P0Y0M. Metrics: Check Long Count and Short Count in the IS Summary.

1. Basic Coverage Analysis

  • Expression: datafield (or vec_op(datafield) for vectors)
  • Insight: % Coverage (Long Count + Short Count) / Universe Size.

2. Non-Zero Value Coverage

  • Expression: datafield != 0 ? 1 : 0
  • Insight: Real coverage (excluding zeros). Distinguishes missing data (NaN) from actual zero values.

3. Data Update Frequency Analysis

  • Expression: ts_std_dev(datafield, N) != 0 ? 1 : 0
  • Insight: Frequency of updates. Vary N:
    • N=5 (Week): Low count implies weekly updates.
    • N=22 (Month): Monthly updates.
    • N=66 (Quarter): Quarterly updates.

4. Data Bounds Analysis

  • Expression: abs(datafield) > X
  • Insight: Check value range. Vary X (e.g., 1, 10, 100) to check scale (e.g., is it normalized -1 to 1?).

5. Central Tendency Analysis

  • Expression: ts_median(datafield, 1000) > X
  • Insight: Typical values over time (5-year median). Vary X to find the center.

6. Data Distribution Analysis

  • Expression: X < scale_down(datafield) && scale_down(datafield) < Y
  • Insight: Distribution shape. scale_down maps to 0-1. Vary X and Y (e.g., 0.1-0.2) to check buckets.

Note on Vector Data

If the datafield is a VECTOR type, wrap it in a vector operator first (e.g., vec_sum(datafield) or vec_mean(datafield)).