AgentSkillsCN

scientific-validation

科学方法用于验证声明,包括预先注册、功效分析、统计严谨性及贝叶斯方法。适用于检验假设、开展实验或验证论文中的声明时。触发条件:验证、假设、实验、回测、证据、统计检验。不触发条件:日常编码、配置变更、文档编写、非实验任务。

SKILL.md
--- frontmatter
name: scientific-validation
description: "Scientific method for validating claims with pre-registration, power analysis, statistical rigor, and Bayesian methods. Use when testing hypotheses, running experiments, or validating claims from papers. TRIGGER when: validate, hypothesis, experiment, backtest, evidence, statistical test. DO NOT TRIGGER when: routine coding, config changes, documentation, non-experimental tasks."
allowed-tools: [Read, Grep, Glob, Bash, Write]

Scientific Validation Skill

Rigorous methodology for validating claims from any source - books, papers, theories, or intuition.

When This Skill Activates

  • Testing claims from books, papers, or expert sources
  • Validating rules, strategies, or hypotheses
  • Running experiments or backtests
  • Keywords: "validate", "test hypothesis", "experiment", "backtest", "prove", "evidence"

Core Principle

Data is the arbiter. Sources can be wrong.

  • Expert books can be wrong
  • Only empirical validation decides what works
  • Document negative results - they're valuable

Phase Overview

PhaseNameKey Requirement
0Claim VerificationUnderstand what source ACTUALLY claims
1Claims ExtractionDocument with source citations
1.5Publication Bias PreventionDocument ALL claims before selecting
2Pre-RegistrationHypothesis BEFORE seeing results
2.3Power AnalysisCalculate required n (MANDATORY)
3Bias PreventionLook-ahead, survivorship, selection
3.5Walk-ForwardRequired for time series (MANDATORY)
4Statistical Requirementsp-values, effect sizes, corrections
4.7Bayesian ComplementBayes Factors for ambiguous results
5Multi-Source ValidationTest across 3+ contexts
5.3Sensitivity Analysis±20% parameter stability (MANDATORY)
5.5Adversarial ReviewInvoke experiment-critic agent
6ClassificationVALIDATED / REJECTED / INSUFFICIENT
7DocumentationComplete audit trail
7.3Negative ResultsStructured failure documentation

See: workflow.md for detailed step-by-step instructions per phase.


Quick Reference

Claim Types

TypeTestable?Example
PERFORMANCEYES"A beats B on metric X"
METHODOLOGICALYES"A enables capability X"
PHILOSOPHICALMAYBE"X is important because Y"
BEHAVIORALHARD"Humans do X in situation Y"

Sample Size Requirements (80% Power)

Effect SizeCohen's dRequired n
Small0.2394
Medium0.564
Large0.826

See: code-examples.md#power-analysis for calculation code.

Classification Criteria

StatusCriteria
VALIDATEDOOS meets all criteria + critic PROCEED
CONDITIONALOOS meets relaxed criteria (p < 0.10)
REJECTEDOOS fails OR negative effect
INSUFFICIENTn < 15 in OOS
UNTESTABLERequired data unavailable
INVALIDCircular validation detected

Domain Effect Thresholds (Trading)

MetricMinimumStrongExceptional
Sharpe Ratio> 0.5> 1.0> 2.0
Win Rate> 55%> 60%> 70%
Profit Factor> 1.2> 1.5> 2.0

See: code-examples.md#effect-thresholds for other domains.

Bayes Factor Interpretation

BFEvidence
< 1Supports null
1-3Anecdotal
3-10Moderate
10-30Strong
> 30Very strong

Critical Rules

1. Pre-Registration

  • Document hypothesis BEFORE seeing any results
  • Define success criteria BEFORE testing
  • No peeking at test data

2. Power Analysis (Phase 2.3)

python
from statsmodels.stats.power import TTestIndPower
n = TTestIndPower().solve_power(effect_size=0.5, power=0.80, alpha=0.05)

Rule: Underpowered studies cannot achieve VALIDATED status.

3. Walk-Forward for Time Series (Phase 3.5)

  • Standard K-fold CV → INVALID (temporal leakage)
  • Single train/test → CONDITIONAL at best
  • Walk-forward → Can achieve VALIDATED

See: code-examples.md#walk-forward for implementation.

4. Multiple Comparison Correction

python
alpha_corrected = 0.05 / num_claims  # Bonferroni

For trading claims: require t-ratio > 3.0 (Harvey et al. standard).

5. Sensitivity Analysis (Phase 5.3)

Test ±20% parameter variation:

  • All variations positive → Can achieve VALIDATED
  • 1-2 sign flips → CONDITIONAL at best
  • 3+ sign flips → REJECTED (fragile)

See: code-examples.md#sensitivity-analysis for implementation.

6. Adversarial Review (Phase 5.5)

code
Use Task tool:
  subagent_type: "experiment-critic"
  prompt: "Review experiment EXP-XXX"

MANDATORY before any classification.


Bias Prevention Checklist

BiasPrevention
Look-aheadProcess data sequentially, compare batch vs streaming
SurvivorshipTrack ALL attempts, not just completions
SelectionReport ALL experiments including failures
Data snoopingStrict train/test split, no tuning on test data
PublicationDocument ALL claims before selecting which to test

Pre-Experiment Checklist

  • Claim extracted with source citation
  • ALL claims documented (not just tested ones)
  • Hypothesis documented BEFORE results
  • Power analysis: required n calculated
  • Success criteria defined
  • Walk-forward configured (time series)
  • Costs/constraints specified

Post-Experiment Checklist

  • Sample size adequate per power analysis
  • p-value AND effect size reported
  • Bayesian analysis if ambiguous
  • Sensitivity analysis passed
  • Adversarial review completed
  • Negative results documented if REJECTED

Red Flags

  • 100% success rate → Possible bias
  • OOS better than training → Possible leakage
  • Result flips with ±20% params → Fragile
  • Only tested "interesting" claims → Selection bias

Key Principles

  1. Hypothesis BEFORE data - No peeking
  2. Power analysis BEFORE experiment - Know required n
  3. Walk-forward for time series - Preserve temporal order
  4. Sensitivity analysis - Results must survive ±20% changes
  5. Adversarial self-critique - Challenge your methodology
  6. Document negative results - Failures are valuable
  7. Sources can be wrong - Even experts, even textbooks

Detailed Documentation

TopicFile
Step-by-step workflowworkflow.md
Python code examplescode-examples.md
Markdown templatestemplates.md
Adversarial review../../agents/experiment-critic.md

Hard Rules

FORBIDDEN:

  • Reporting results without confidence intervals or statistical significance
  • Cherry-picking favorable metrics while ignoring unfavorable ones
  • Claiming causation from correlation without controlled experiments

REQUIRED:

  • All experiments MUST have a documented hypothesis before execution
  • All results MUST include sample size, variance, and statistical test used
  • Negative results MUST be reported with the same rigor as positive results
  • Baselines MUST be established and compared against for every metric