ITS Setup

Set up inference-time scaling algorithms for improved reasoning at inference time.

Core Concepts

Inference-time scaling trades compute at inference for better outputs:

•Generate multiple candidates
•Use reward models to select the best
•Or use structured search to find optimal solutions

Available Algorithms

Best-of-N

Generate N samples, return the highest-scoring one.

python

from its_hub.algorithms import BestOfN
from its_hub.lms import OpenAICompatibleLanguageModel
from reward_hub import AutoRM

# Load language model
lm = OpenAICompatibleLanguageModel(
    model="meta-llama/Llama-3.1-8B-Instruct",
    base_url="http://localhost:8000/v1",
)

# Load reward model
rm = AutoRM.load("Qwen/Qwen2.5-Math-PRM-7B", load_method="vllm")

# Create algorithm
alg = BestOfN(orm=rm)

# Run inference
result = alg.infer(lm, prompt="Solve: 2x + 3 = 7", budget=8)
print(result.answer)
print(result.score)

Self-Consistency

Generate multiple solutions, use majority voting.

python

from its_hub.algorithms import SelfConsistency

alg = SelfConsistency(
    answer_extractor="boxed",  # Extract answer from \boxed{...}
    aggregation="majority",     # majority, weighted
)

result = alg.infer(lm, prompt="What is 15% of 80?", budget=16)
print(result.answer)
print(result.consistency_score)  # Fraction agreeing with answer

Beam Search

Structured search through solution space.

python

from its_hub.algorithms import BeamSearch

alg = BeamSearch(
    prm=prm,  # Process Reward Model for step scoring
    beam_width=4,
    max_depth=10,
)

result = alg.infer(lm, prompt="Prove that sqrt(2) is irrational")
print(result.answer)
print(result.path)  # Steps taken to reach answer

Particle Filtering

Maintain particle distribution over solution space.

python

from its_hub.algorithms import ParticleFilter

alg = ParticleFilter(
    prm=prm,
    num_particles=16,
    resample_threshold=0.5,
)

result = alg.infer(lm, prompt="Complex multi-step problem...")

Language Model Configuration

OpenAI-compatible endpoints

python

from its_hub.lms import OpenAICompatibleLanguageModel

lm = OpenAICompatibleLanguageModel(
    model="gpt-4",
    base_url="https://api.openai.com/v1",
    api_key="sk-...",
    temperature=0.7,
    max_tokens=2048,
)

vLLM local server

python

lm = OpenAICompatibleLanguageModel(
    model="meta-llama/Llama-3.1-8B-Instruct",
    base_url="http://localhost:8000/v1",
    api_key="EMPTY",
)

HuggingFace models (direct)

python

from its_hub.lms import HuggingFaceLanguageModel

lm = HuggingFaceLanguageModel(
    model_path="meta-llama/Llama-3.1-8B-Instruct",
    device="cuda",
    torch_dtype="bfloat16",
)

Reward Model Integration

Using reward_hub models

python

from reward_hub import AutoRM
from its_hub.integration.reward_hub import RewardHubORM, RewardHubPRM

# Outcome Reward Model (scores full solutions)
orm_model = AutoRM.load("internlm/internlm2-7b-reward", load_method="vllm")
orm = RewardHubORM(orm_model)

# Process Reward Model (scores steps)
prm_model = AutoRM.load("Qwen/Qwen2.5-Math-PRM-7B", load_method="vllm")
prm = RewardHubPRM(prm_model)

Custom reward functions

python

from its_hub.rewards import BaseORM

class CustomORM(BaseORM):
    def score(self, prompt: str, response: str) -> float:
        # Your scoring logic
        if "error" in response.lower():
            return 0.0
        return len(response) / 1000  # Simple length-based score

Budget Configuration

python

# Fixed budget
result = alg.infer(lm, prompt, budget=8)

# Dynamic budget
result = alg.infer(
    lm,
    prompt,
    budget=BudgetConfig(
        min_samples=4,
        max_samples=32,
        early_stop_threshold=0.95,  # Stop if confidence > 95%
    ),
)

Batched Inference

python

# Process multiple prompts efficiently
prompts = ["Solve: x + 1 = 2", "Solve: 2x = 6", "Solve: x^2 = 4"]

results = alg.infer_batch(
    lm,
    prompts,
    budget=8,
    batch_size=32,  # LLM batch size
)

for prompt, result in zip(prompts, results):
    print(f"{prompt} -> {result.answer}")

Temperature Strategies

python

# Different temperatures for exploration
alg = BestOfN(
    orm=orm,
    temperature_schedule=[0.7, 0.8, 0.9, 1.0],  # Vary per sample
)

# Or use fixed temperature
alg = BestOfN(
    orm=orm,
    temperature=0.8,
)

Output Handling

python

result = alg.infer(lm, prompt, budget=8)

# Access results
print(result.answer)           # Best answer
print(result.score)            # Score of best answer
print(result.all_candidates)   # All generated candidates
print(result.all_scores)       # Scores for all candidates
print(result.metadata)         # Algorithm-specific metadata

Common Configurations

Math problem solving

python

from its_hub.algorithms import BestOfN
from reward_hub import AutoRM

rm = AutoRM.load("Qwen/Qwen2.5-Math-PRM-7B", load_method="vllm")
alg = BestOfN(orm=rm)

result = alg.infer(lm, "Solve step by step: ...", budget=16)

Code generation

python

alg = SelfConsistency(
    answer_extractor="code_block",  # Extract ```code```
    aggregation="exact_match",
)

result = alg.infer(lm, "Write a function to...", budget=8)

Complex reasoning

python

alg = BeamSearch(
    prm=prm,
    beam_width=8,
    max_depth=15,
    step_delimiter="\n\n",  # Split on double newlines
)

result = alg.infer(lm, "Prove that...", budget=64)

Related Skills

•/reward-configure - Configure reward models
•/pipeline-design - Design end-to-end pipelines