ITS Setup
Set up inference-time scaling algorithms for improved reasoning at inference time.
Core Concepts
Inference-time scaling trades compute at inference for better outputs:
- •Generate multiple candidates
- •Use reward models to select the best
- •Or use structured search to find optimal solutions
Available Algorithms
Best-of-N
Generate N samples, return the highest-scoring one.
python
from its_hub.algorithms import BestOfN
from its_hub.lms import OpenAICompatibleLanguageModel
from reward_hub import AutoRM
# Load language model
lm = OpenAICompatibleLanguageModel(
model="meta-llama/Llama-3.1-8B-Instruct",
base_url="http://localhost:8000/v1",
)
# Load reward model
rm = AutoRM.load("Qwen/Qwen2.5-Math-PRM-7B", load_method="vllm")
# Create algorithm
alg = BestOfN(orm=rm)
# Run inference
result = alg.infer(lm, prompt="Solve: 2x + 3 = 7", budget=8)
print(result.answer)
print(result.score)
Self-Consistency
Generate multiple solutions, use majority voting.
python
from its_hub.algorithms import SelfConsistency
alg = SelfConsistency(
answer_extractor="boxed", # Extract answer from \boxed{...}
aggregation="majority", # majority, weighted
)
result = alg.infer(lm, prompt="What is 15% of 80?", budget=16)
print(result.answer)
print(result.consistency_score) # Fraction agreeing with answer
Beam Search
Structured search through solution space.
python
from its_hub.algorithms import BeamSearch
alg = BeamSearch(
prm=prm, # Process Reward Model for step scoring
beam_width=4,
max_depth=10,
)
result = alg.infer(lm, prompt="Prove that sqrt(2) is irrational")
print(result.answer)
print(result.path) # Steps taken to reach answer
Particle Filtering
Maintain particle distribution over solution space.
python
from its_hub.algorithms import ParticleFilter
alg = ParticleFilter(
prm=prm,
num_particles=16,
resample_threshold=0.5,
)
result = alg.infer(lm, prompt="Complex multi-step problem...")
Language Model Configuration
OpenAI-compatible endpoints
python
from its_hub.lms import OpenAICompatibleLanguageModel
lm = OpenAICompatibleLanguageModel(
model="gpt-4",
base_url="https://api.openai.com/v1",
api_key="sk-...",
temperature=0.7,
max_tokens=2048,
)
vLLM local server
python
lm = OpenAICompatibleLanguageModel(
model="meta-llama/Llama-3.1-8B-Instruct",
base_url="http://localhost:8000/v1",
api_key="EMPTY",
)
HuggingFace models (direct)
python
from its_hub.lms import HuggingFaceLanguageModel
lm = HuggingFaceLanguageModel(
model_path="meta-llama/Llama-3.1-8B-Instruct",
device="cuda",
torch_dtype="bfloat16",
)
Reward Model Integration
Using reward_hub models
python
from reward_hub import AutoRM
from its_hub.integration.reward_hub import RewardHubORM, RewardHubPRM
# Outcome Reward Model (scores full solutions)
orm_model = AutoRM.load("internlm/internlm2-7b-reward", load_method="vllm")
orm = RewardHubORM(orm_model)
# Process Reward Model (scores steps)
prm_model = AutoRM.load("Qwen/Qwen2.5-Math-PRM-7B", load_method="vllm")
prm = RewardHubPRM(prm_model)
Custom reward functions
python
from its_hub.rewards import BaseORM
class CustomORM(BaseORM):
def score(self, prompt: str, response: str) -> float:
# Your scoring logic
if "error" in response.lower():
return 0.0
return len(response) / 1000 # Simple length-based score
Budget Configuration
python
# Fixed budget
result = alg.infer(lm, prompt, budget=8)
# Dynamic budget
result = alg.infer(
lm,
prompt,
budget=BudgetConfig(
min_samples=4,
max_samples=32,
early_stop_threshold=0.95, # Stop if confidence > 95%
),
)
Batched Inference
python
# Process multiple prompts efficiently
prompts = ["Solve: x + 1 = 2", "Solve: 2x = 6", "Solve: x^2 = 4"]
results = alg.infer_batch(
lm,
prompts,
budget=8,
batch_size=32, # LLM batch size
)
for prompt, result in zip(prompts, results):
print(f"{prompt} -> {result.answer}")
Temperature Strategies
python
# Different temperatures for exploration
alg = BestOfN(
orm=orm,
temperature_schedule=[0.7, 0.8, 0.9, 1.0], # Vary per sample
)
# Or use fixed temperature
alg = BestOfN(
orm=orm,
temperature=0.8,
)
Output Handling
python
result = alg.infer(lm, prompt, budget=8) # Access results print(result.answer) # Best answer print(result.score) # Score of best answer print(result.all_candidates) # All generated candidates print(result.all_scores) # Scores for all candidates print(result.metadata) # Algorithm-specific metadata
Common Configurations
Math problem solving
python
from its_hub.algorithms import BestOfN
from reward_hub import AutoRM
rm = AutoRM.load("Qwen/Qwen2.5-Math-PRM-7B", load_method="vllm")
alg = BestOfN(orm=rm)
result = alg.infer(lm, "Solve step by step: ...", budget=16)
Code generation
python
alg = SelfConsistency(
answer_extractor="code_block", # Extract ```code```
aggregation="exact_match",
)
result = alg.infer(lm, "Write a function to...", budget=8)
Complex reasoning
python
alg = BeamSearch(
prm=prm,
beam_width=8,
max_depth=15,
step_delimiter="\n\n", # Split on double newlines
)
result = alg.infer(lm, "Prove that...", budget=64)
Related Skills
- •
/reward-configure- Configure reward models - •
/pipeline-design- Design end-to-end pipelines