BRAIN Data Feature Engineering Workflow

Purpose: Automatically transform BRAIN dataset fields into deep, meaningful feature engineering ideas.

For Detailed Mindset Patterns: See reference.md for feature engineering philosophy. For Implementation Examples: See examples.md for case studies.

Input Requirements

Required Parameters:

•data_category: Dataset category (e.g., "fundamental", "analyst", "news", "model")
•delay: Data delay setting (0 or 1)
•region: Market region (e.g., "USA", "EUR", "ASI")

Optional Parameters:

•universe: Trading universe (default: "TOP3000")
•dataset_id: Specific dataset ID (if known, skips discovery phase)

Workflow Overview

Step 1: Dataset Discovery

Autonomous Action:

•Call mcp__brain-mcp__get_datasets with parameters (category, delay, region, universe)
•If dataset_id provided: Validate and use it
•If dataset_id not provided: Select the most relevant dataset based on metadata analysis
•Output: Locked dataset_id for analysis

Step 2: Field Extraction and Deconstruction

Autonomous Action:

•Call mcp__brain-mcp__get_datafields for the selected dataset
•For each field, extract: id, description, dataType, update frequency, coverage
•
Deconstruct each field's meaning:
- •What is being measured? (the entity/concept)
- •How is it measured? (collection/calculation method)
- •Time dimension? (instantaneous, cumulative, rate of change)
- •Business context? (why does this field exist?)
- •Generation logic? (reliability considerations)
•Build field profiles: Structured understanding of each field's essence

Step 3: Autonomous Thinking and Analysis

The skill performs deep analysis based on collected information:

A. Field Relationship Mapping

•Analyze logical connections between fields
•Identify: independent fields, related fields, complementary fields
•Map the "story" the dataset tells
•Key question: What relationships are implied by these fields?

B. Question-Driven Feature Generation (Internal Process) The skill asks itself these questions and generates feature concepts:

•
"What is stable?" → Look for invariants
- •Which fields or combinations remain relatively constant?
- •What stability measures make sense?
•
"What is changing?" → Analyze change patterns
- •Rate of change, acceleration, volatility
- •Trend vs. noise separation
•
"What is anomalous?" → Identify deviations
- •Outliers, unusual patterns, breaks from normal
- •Deviation magnitude and significance
•
"What is combined?" → Examine interactions
- •How fields interact, amplify, or offset each other
- •Synthesis creates new meaning
•
"What is structural?" → Study compositions
- •Constituent parts, proportional relationships
- •Structural changes over time
•
"What is cumulative?" → Explore accumulation effects
- •Building up over time, decay effects
- •Memory and persistence in data
•
"What is relative?" → Make comparisons
- •Relative positioning, ranking, normalization
- •Context within dataset
•
"What is essential?" → Distill to core meaning
- •First principles thinking
- •Strip away assumptions, get to essence

C. Feature Concept Generation For each relevant question-field combination:

•Formulate feature concept that answers the question
•Define the concept clearly
•Identify the logical meaning
•Consider directionality (what high/low values mean)
•Identify boundary conditions
•Note potential issues/limitations

Step 4: Feature Documentation

For each generated feature concept, document:

•Concept Name: Clear, descriptive name
•Definition: One-sentence definition
•Logical Meaning: What phenomenon/concept does it represent?
•Why It's Meaningful: Why does this feature make sense?
•Directionality: Interpretation of high vs. low values
•Boundary Conditions: What extremes indicate
•Data Requirements: What fields are used and any constraints
•Potential Issues: Known limitations or concerns

Step 5: Output Generation

Generate structured markdown report including:

•
Write the report to ./output_report/region_delay_datasetID_ideas.md in the following format:
•
Dataset Understanding
- •Dataset description and characteristics
- •Field inventory (count, types, update patterns)
- •Key observations about data structure
•
Field Deconstruction Analysis
- •For each field: what it truly measures and why
- •Logical relationships between fields
- •"Story" the data tells
•
Feature Engineering Suggestions by Question Type

3.1 Stability Features
- •Concepts for measuring stability/invariance
- •Why stability matters in this dataset
- •Example implementations
3.2 Change Features
- •Concepts for capturing change patterns
- •Rate, acceleration, volatility measures
- •Temporal dynamics
3.3 Anomaly Features
- •Deviation and outlier detection concepts
- •Normal vs. abnormal identification
- •Significance measures
3.4 Interaction Features
- •Cross-field interaction concepts
- •Amplification, offset, synthesis effects
- •Combined meaning creation
3.5 Structure Features
- •Composition and relationship concepts
- •Proportional analysis
- •Structural change detection
3.6 Cumulative Features
- •Accumulation and decay concepts
- •Memory/persistence measures
- •Time-weighted effects
3.7 Relative Features
- •Comparison and normalization concepts
- •Ranking and percentile measures
- •Context-relative positioning
3.8 Essential Features
- •First-principles derived concepts
- •Core meaning extraction
- •Fundamental measures
•
Implementation Considerations
- •Data quality notes
- •Coverage considerations
- •Computational complexity
- •Potential improvements/extensions
•
Critical Questions for Further Exploration
- •What aspects weren't covered?
- •What additional data would be helpful?
- •What assumptions should be challenged?

Core Analysis Principles

•From Data Essence: Start with what data truly means, not what it's traditionally used for
•Autonomous Reasoning: Skill performs all thinking, no user input required
•Question-Driven: Internal question bank guides feature generation
•Meaning Over Patterns: Prioritize logical meaning over conventional combinations
•Transparency: Show reasoning process in output

Example Output Structure

When analyzing dataset 'BEME' (Balance Sheet and Market Data), the output would include:

Dataset Understanding

Fields Analyzed: book_value, market_cap, book_to_market, etc. Key Observations: Dataset compares accounting values with market valuations

Field Deconstruction

•book_value: Accountant's calculation of net asset value (quarterly, audited, historical cost-based)
•market_cap: Market participants' valuation (continuous, forward-looking, sentiment-influenced)
•book_to_market: Ratio comparing these two valuation perspectives

Feature Concepts Generated

From "What is stable?"

•"Market reevaluation stability": Rolling coefficient of variation of book_to_market
•Logic: Measures whether market opinion is stable or volatile
•Meaning: Stable values suggest consensus, volatile values suggest disagreement/uncertainty

From "What is changing?"

•"Value creation vs. market reevaluation decomposition": Separate book_value growth from market_cap growth
•Logic: Distinguish fundamental value creation from market sentiment changes
•Meaning: Which component drives changes in book_to_market?

From "What is combined?"

•"Intangible value proportion": (market_cap - book_value) / enterprise_value
•Logic: Quantify proportion of value from intangibles (brand, growth, etc.)
•Meaning: What percentage of valuation isn't captured on the balance sheet?

(Additional question-based features would follow...)

Implementation Notes

The skill should:

•Analyze first, then generate: Fully understand dataset before proposing features
•Show reasoning: Explain why each feature concept makes sense
•Be specific: Reference actual field names and their characteristics
•Be critical: Question assumptions and identify limitations
•Be creative: Look beyond traditional financial metrics

The skill should NOT:

•Ask users to think: All thinking is internal to the skill
•Provide generic templates: Each analysis should be specific to the dataset
•Rely on conventional wisdom: Challenge traditional approaches
•Output patterns without meaning: Every suggestion must have clear logic

Quality Assurance

Self-Check Process:

• All fields analyzed, not just skimmed
• Field meanings understood beyond descriptions
• Multiple question types explored
• Each feature has clear logical meaning
• Reasoning is explicit, not implicit
• Limitations are acknowledged
• Output is dataset-specific, not generic

Validation Questions:

•Would this analysis help someone truly understand the data?
•Are feature concepts novel yet meaningful?
•Is the reasoning process transparent?
•Does it avoid conventional thinking traps?

This skill performs autonomous deep analysis of BRAIN datasets, generating meaningful feature engineering concepts based on data essence and logical reasoning.