Scaling and the Road to Human-Level AI

Strategic framework for understanding AI scaling laws and building products that leverage predictable AI capability improvements.

Core Concepts

Two Phases of AI Training

Pretraining: Models learn to predict the next token by imitating human-written text, understanding underlying correlations in data.

Reinforcement Learning (RL): Models are optimized based on human feedback, reinforcing helpful/honest/harmless behaviors and discouraging harmful ones.

Scaling laws exist for both phases—performance improves predictably with increased compute, data, and parameters.

Key Metrics

•Task Horizon: Length/complexity of tasks AI can complete, measured in equivalent human time
•Elo Scores: Rating system measuring model preference comparisons
•Context Window: Amount of information processable in a single conversation

Scaling Law Reliability

Scaling laws have held across 5+ orders of magnitude with physics-level precision. When scaling appears broken, assume training implementation issues first, not fundamental limits.

Strategic Decision Framework

Assess Current AI Capabilities

Use the two-axis capability framework:

•Y-axis (Flexibility): What modalities can the model handle?
•X-axis (Task Horizon): What equivalent human-time tasks can it complete?

Current trajectory: Task horizons double approximately every 7 months.

Product Timing Strategy

code

Current capability assessment:
├── Works reliably now → Build and ship immediately
├── Works 70-80% of time → Viable for error-tolerant use cases
├── Works marginally → Build now, ship when next model releases
└── Doesn't work at all → Wait 1-2 model generations

Key insight: Build products that don't quite work yet with current AI capabilities. Target capabilities slightly beyond current models—future models will make marginal products work.

Use Case Selection Criteria

Prioritize applications where:

•70-80% accuracy is acceptable
•Breadth of knowledge matters more than deep focus on one hard problem
•Cross-domain synthesis creates value (biology + psychology + history)
•Human review can catch and correct errors

Deprioritize applications requiring:

•Near-perfect accuracy on first attempt
•Deep specialized reasoning without verification
•Tasks where errors compound catastrophically

Human-AI Collaboration Model

Role Division

Position humans as managers and sanity-checkers:

•AI generates options and drafts
•Humans verify, select, and course-correct
•AI's judgment-generation gap is smaller than humans'

Leverage AI's Strengths

Breadth over depth: AI excels at synthesizing information across many domains simultaneously. Target applications requiring:

•Literature synthesis across fields
•Pattern recognition across diverse data sources
•Rapid exploration of solution spaces

Practical Workflow

•Define the task scope and success criteria
•Have AI generate initial approach/draft
•Review for sanity and strategic alignment
•Iterate with targeted corrections
•Use AI to refine based on feedback

Forecasting AI Capabilities

Timeline Estimation Method

code

To estimate when a capability becomes viable:

1. Identify current task horizon (what length tasks work reliably)
2. Apply 7-month doubling rule
3. Calculate generations needed:
   - Hour-long tasks → Day-long tasks: ~3 doublings (~21 months)
   - Day-long tasks → Week-long tasks: ~3 doublings (~21 months)
   - Week-long tasks → Month-long tasks: ~4 doublings (~28 months)

Self-Correction Multiplier

Each improvement in a model's ability to notice and correct its own mistakes roughly doubles task horizon length. Factor this into capability forecasts.

Integration Strategy

Avoid the Steam Engine Mistake

Don't just replace existing processes with AI equivalents. Redesign entire systems around AI capabilities (electricity adoption analogy—factories were redesigned around electric motors, not just swapping steam for electric).

Accelerate Adoption

Use AI to integrate AI into products and businesses. The bottleneck is adoption speed, not capability. When facing integration challenges:

•Have AI analyze your current workflow
•Identify substitution points and redesign opportunities
•Prototype with AI assistance
•Iterate rapidly

Jevons Paradox Awareness

Expect that increased AI efficiency leads to increased consumption, not decreased cost. Plan for:

•More AI usage as capabilities improve
•New use cases emerging from better performance
•Expanding scope rather than shrinking budgets

Diagnostic Framework

When Scaling Appears Broken

Before concluding a capability limit exists:

•Verify training/prompting methodology
•Check for data quality issues
•Test with alternative approaches
•Compare against scaling law predictions

Default assumption: Implementation issues, not fundamental limits.

Evaluating Model Improvements

Compare new models against:

•Expected scaling law trajectory
•Task horizon benchmarks
•Cross-domain performance consistency

Deviations from smooth improvement suggest training issues worth investigating.

Example Applications

Product Development Decision

Scenario: Building an AI code review tool

code

Assessment:
- Current models: Reliable for single-file reviews (~minutes)
- Target capability: Full PR reviews with context (~hours)
- Gap: ~2-3 doublings needed

Decision: Build now with single-file scope, architecture for expansion.
Ship current capability, expand automatically as models improve.

Capability Targeting

Scenario: Choosing between deep analysis vs broad synthesis features

code

AI strength analysis:
- Deep focus on one hard problem: Human-competitive, not superior
- Synthesizing across 10 domains: Clear AI advantage

Decision: Prioritize cross-domain synthesis features.
Example: Research assistant that connects findings across biology,
psychology, and economics papers simultaneously.

Timeline Planning

Scenario: When will AI handle week-long research projects reliably?

code

Current state (2024): Hour-long tasks reliable
Doubling rate: ~7 months

Calculation:
- Hour → Day: 3 doublings = 21 months
- Day → Week: 3 doublings = 21 months
- Total: ~42 months (rough estimate)

Planning implication: Build infrastructure now, expect capability 2027-2028.