AgentSkillsCN

ai-scientific-discovery-jumper

基于AlphaFold的研发经验,为科学发现领域的AI系统构建提供指导。当您在为科学领域设计ML系统、规划使用盲测基准的验证策略、决定如何以最大影响力发布科学AI工具、评估数据获取与架构研究的投资比重、打造便于领域专家使用的工具、以计算预算规划ML研究项目,或判断一款科学AI工具是否已跨越面向真实世界应用的相关性门槛时,可使用此技能。

SKILL.md
--- frontmatter
name: ai-scientific-discovery-jumper
description: Guidance for building AI systems for scientific discovery, based on lessons from AlphaFold's development. Use when designing ML systems for scientific domains, planning validation strategies using blind benchmarks, deciding how to release scientific AI tools for maximum impact, evaluating data acquisition vs architectural research investment, building tools for domain expert adoption, structuring ML research projects with compute budgeting, or assessing when a scientific AI tool has crossed the relevance threshold for real-world use.

AI for Scientific Discovery

Principles and workflows for building AI systems that accelerate scientific research, derived from the development of AlphaFold at DeepMind.

Core Principles

Research Ideas as Multipliers

Research and architectural innovation multiply the value of data and compute rather than adding linearly.

Key insight: AlphaFold2 trained on 1% of available data outperformed AlphaFold1 trained on 100% of data. This demonstrates that ideas can provide 100x leverage over data alone.

Implication: When building ML systems for scientific domains, prioritize research iteration over data acquisition. Most compute budget should account for failed research iterations, not final model training.

Many Mid-Scale Ideas

Breakthroughs come from accumulating many medium-sized innovations, not one revolutionary idea.

Key insight: In AlphaFold ablation studies, no single component explained more than 10% of the system's improvement. The breakthrough emerged from combining numerous architectural innovations.

Implication: Avoid searching for a single "magic bullet" solution. Instead, systematically develop and integrate multiple complementary improvements.

Biological Relevance Threshold

Real-world impact requires crossing a threshold where predictions become useful to domain experts who don't care about ML metrics.

Key insight: Experimental biologists adopted AlphaFold not because of benchmark scores, but because predictions matched their private experimental data at a useful accuracy level.

Implication: Define success by domain expert utility, not benchmark performance. Track adoption signals from target users, not just technical metrics.

Workflow: Designing Scientific AI Systems

Step 1: Understand the Domain Problem

  1. Identify the core scientific question (e.g., "How does a linear amino acid chain fold into a 3D structure?")
  2. Map existing experimental methods and their limitations
  3. Determine what accuracy level constitutes "useful" for practitioners
  4. Identify available data sources and their characteristics

Step 2: Evaluate Off-the-Shelf Approaches

  1. Test standard ML architectures on the problem
  2. Document where generic approaches fail or underperform
  3. Identify domain-specific constraints that require custom solutions
  4. Establish baseline performance for comparison

Warning: Off-the-shelf ML approaches are typically insufficient for scientific breakthroughs. Expect to develop domain-specific innovations.

Step 3: Invest in Research Iteration

  1. Allocate majority of compute budget to experimentation, not final training
  2. Design fast iteration cycles to test architectural hypotheses
  3. Track ideas systematically—expect most to fail
  4. Look for compounding improvements from combining innovations

Step 4: Validate with Blind Benchmarks

Use held-out test sets where answers are unknown to prevent overfitting:

code
Validation Strategy:
├── Training data: Known structures (e.g., Protein Data Bank)
├── Validation data: Subset of known structures, held out
└── Blind test: Problems with unpublished answers
    └── Example: CASP competition (biennial, ~100 proteins, answers unreleased)

Critical: True capability can only be measured on problems where answers are unknown to everyone, including system developers.

Step 5: Cross the Relevance Threshold

  1. Identify domain experts willing to test against their private data
  2. Track word-of-mouth adoption signals
  3. Measure whether users validate predictions independently
  4. Monitor for emergent use cases you didn't anticipate

Workflow: Releasing Scientific AI Tools

Accessibility Over Capability

A tool's impact depends more on ease of access than raw capability.

Example: AlphaFold's database of 200 million pre-computed predictions drove adoption far more than open-source code alone.

Release Strategy Checklist

code
□ Code release (minimum viable)
  - Open source the model and training code
  - Provide documentation for technical users

□ Pre-computed results (high impact)
  - Generate predictions for common use cases
  - Host accessible database
  - Enable non-technical users to benefit

□ API access (broader reach)
  - Remove computational barriers
  - Enable integration into existing workflows

□ User interface (maximum accessibility)
  - Build tools for domain experts without ML expertise
  - Focus on their workflow, not your architecture

Anticipate Emergent Capabilities

Power users will discover capabilities you didn't design for.

Example: Protein interaction prediction emerged from AlphaFold 2 days after release—a capability the team hadn't explicitly built or tested.

Action: After release, monitor novel use cases and validate whether they work reliably.

Decision Framework: Data vs Research Investment

When allocating resources between data acquisition and research:

code
Prioritize DATA when:
├── Problem is well-understood
├── Standard architectures perform reasonably
├── Diminishing returns not yet observed
└── Data collection is feasible and affordable

Prioritize RESEARCH when:
├── Off-the-shelf approaches fundamentally fail
├── Domain constraints require custom solutions
├── Data is expensive or limited
└── Existing approaches hit performance ceiling

Default bias: Lean toward research investment. The AlphaFold case demonstrates that architectural innovation can provide orders of magnitude more value than data scaling.

Validation Checklist for Scientific AI

Before claiming a system works for real-world use:

code
□ Blind benchmark validation
  - Test on problems with unknown answers
  - Use established domain competitions if available
  - Avoid overfitting to published test sets

□ Ablation studies
  - Remove each component systematically
  - Measure contribution to overall performance
  - Verify no single idea explains >50% of improvement

□ Domain expert validation
  - Find practitioners with private test data
  - Track whether they adopt the tool
  - Monitor word-of-mouth signals

□ Relevance threshold assessment
  - Define "useful" accuracy for domain experts
  - Measure whether system crosses this threshold
  - Distinguish benchmark improvement from practical utility

Key Concepts Reference

ConceptDefinitionRelevance
Blind benchmarkEvaluation where answers are unknown to all participantsPrevents overfitting, validates true capability
Ablation studySystematic removal of components to measure contributionIdentifies which innovations matter
Relevance thresholdAccuracy level where predictions become useful to practitionersDefines real-world success
EquivarianceNeural network property where outputs transform predictably with inputsImportant for physical/geometric domains
Self-distillationUsing model predictions as additional training dataTechnique for improving with limited labeled data

Anti-Patterns to Avoid

Benchmark Obsession

Optimizing for published benchmarks while missing what domain experts actually need.

Fix: Define success by practitioner adoption, not leaderboard position.

Data Hoarding

Assuming more data will solve fundamental architectural limitations.

Fix: Test whether 10x more data would plausibly help before investing in collection.

Capability Without Accessibility

Building powerful tools that domain experts can't use.

Fix: Release in the most accessible format possible—pre-computed results often beat open-source code.

Single Big Idea Hunting

Searching for one revolutionary insight instead of accumulating improvements.

Fix: Track many mid-scale ideas; expect breakthroughs from combinations.

Closed Validation

Testing only on data where you know the answers.

Fix: Participate in blind benchmarks; seek domain experts with private test data.