Offline vs Online Evaluation
Definitions
Offline Evaluation
Test on static dataset before deployment
code
Dataset: Ground truth test set (fixed) When: During development Metrics: Accuracy, F1, BLEU, RAG metrics Environment: Development/staging
Online Evaluation
Measure in production with real users
code
Dataset: Live traffic (dynamic) When: In production Metrics: User satisfaction, task success, engagement Environment: Production
Why Both Matter
Offline: Fast Iteration, Controlled Testing
code
Pros: - Fast (minutes to hours) - Reproducible (same dataset) - Safe (no user impact) - Cheap (no production traffic) Use for: - Rapid development - Comparing models - Regression testing
Online: Real-World Performance, Actual User Impact
code
Pros: - Real performance (actual users) - Actual impact (business metrics) - Catches issues offline can't (latency, UX) Use for: - Final validation - Measuring business impact - Continuous monitoring
Need Both for Complete Picture
code
Offline: "Model A is 5% more accurate" Online: "Model A increases user satisfaction by 10%" Both needed to make informed decisions
Offline Evaluation
When
During development, before deployment
Dataset
Ground truth test set (500-5000 examples)
Metrics
- •Accuracy: % correct
- •F1 Score: Precision + recall
- •BLEU/ROUGE: Generation quality
- •RAG Metrics: Faithfulness, relevance
Pros
- •Fast: Evaluate 1000 examples in minutes
- •Reproducible: Same dataset → same results
- •Safe: No user impact
- •Cheap: No production costs
Cons
- •May not reflect real performance: Test set ≠ production
- •Missing context: No user behavior, latency, UX
- •Dataset drift: Production changes over time
Process
code
1. Create evaluation dataset (ground truth) 2. Run model on dataset 3. Compute metrics 4. Compare to baseline 5. Iterate until metrics improve 6. Deploy to production
Online Evaluation
When
In production with real users
Dataset
Live traffic (actual user queries)
Metrics
- •User satisfaction: Thumbs up/down, ratings
- •Task success: Did user achieve goal?
- •Engagement: Click-through rate, time on page
- •Efficiency: Time to complete, steps needed
- •Safety: Violations, flags, escalations
Pros
- •Real performance: Actual users, actual queries
- •Actual impact: Business metrics (revenue, retention)
- •Catches real issues: Latency, UX, edge cases
Cons
- •Slower: Need traffic, statistical significance
- •Risky: Bad model affects real users
- •Requires traffic: Can't test without users
Methods
- •A/B testing
- •Shadow mode
- •Canary deployment
- •Interleaving
Online Evaluation Methods
A/B Testing (Two Variants)
Setup:
code
Control (A): Current model (50% of users) Treatment (B): New model (50% of users) Random assignment of users Measure metrics for both Statistical significance test Ship winner
Example:
python
import random
def assign_variant(user_id):
if hash(user_id) % 2 == 0:
return "A" # Control
else:
return "B" # Treatment
# Serve model based on variant
variant = assign_variant(user_id)
if variant == "A":
answer = model_a.predict(question)
else:
answer = model_b.predict(question)
# Log result
log_result(user_id, variant, question, answer, user_feedback)
Statistical Significance:
python
from scipy.stats import ttest_ind
satisfaction_a = [4, 5, 3, 4, 5, ...] # User ratings for A
satisfaction_b = [5, 5, 4, 5, 4, ...] # User ratings for B
t_stat, p_value = ttest_ind(satisfaction_a, satisfaction_b)
if p_value < 0.05:
print("Statistically significant difference!")
if mean(satisfaction_b) > mean(satisfaction_a):
print("Ship variant B")
else:
print("No significant difference")
Shadow Mode (Log but Don't Serve)
Setup:
code
Production: Serve model A (current) Shadow: Run model B in background (don't serve) Compare: Model B predictions vs model A predictions No user impact: Users only see model A
Example:
python
# Serve current model
answer_a = model_a.predict(question)
serve_to_user(answer_a)
# Run new model in shadow mode (async)
async def shadow_predict():
answer_b = model_b.predict(question)
log_shadow_result(question, answer_a, answer_b)
# Compare
if answer_a != answer_b:
log_difference(question, answer_a, answer_b)
asyncio.create_task(shadow_predict())
Benefits:
- •No user impact (safe)
- •Real production traffic
- •Can compare models directly
Canary Deployment (Small % of Traffic)
Setup:
code
1. Deploy new model to 1% of traffic 2. Monitor closely (errors, latency, satisfaction) 3. If good, increase to 5% 4. Gradually increase to 100% 5. Rollback if issues
Example:
python
def get_model(user_id):
# Canary: 5% of users get new model
if hash(user_id) % 100 < 5:
return model_b # New model (canary)
else:
return model_a # Current model
model = get_model(user_id)
answer = model.predict(question)
# Monitor metrics
monitor_metrics(model_version=model.version, user_id=user_id)
Rollback:
python
# If error rate > 5% for canary
if canary_error_rate > 0.05:
rollback_to_previous_version()
alert_team("Canary deployment failed")
Interleaving (Mix Results)
Setup:
code
Show results from both models, interleaved Track which results users click Infer which model is better
Example (Search Ranking):
code
Model A results: [R1_A, R2_A, R3_A, R4_A] Model B results: [R1_B, R2_B, R3_B, R4_B] Interleaved: [R1_A, R1_B, R2_A, R2_B, R3_A, R3_B] User clicks: R1_B, R2_A → Model B: 1 click, Model A: 1 click
Online Metrics
Engagement
- •Click-through rate (CTR): % of users who click
- •Time on page: How long users stay
- •Pages per session: How many pages viewed
- •Bounce rate: % who leave immediately
Satisfaction
- •Thumbs up/down: Explicit feedback
- •Star ratings: 1-5 scale
- •NPS (Net Promoter Score): "Would you recommend?"
- •CSAT (Customer Satisfaction): "How satisfied are you?"
Task Success
- •Completion rate: % who complete task
- •Success rate: % who achieve goal
- •Retry rate: % who retry query
- •Abandonment rate: % who give up
Efficiency
- •Time to complete: How long to finish task
- •Steps needed: How many interactions
- •Query reformulations: How many retries
Safety
- •Violation rate: % of harmful outputs
- •Flag rate: % of flagged responses
- •Escalation rate: % requiring human intervention
Implicit Signals
Did User Click Result? (Relevance)
code
User clicks result → Likely relevant User doesn't click → Likely not relevant
Did User Reformulate Query? (Dissatisfaction)
code
User reformulates → Dissatisfied with answer User doesn't reformulate → Satisfied
Did User Abandon? (Failure)
code
User abandons → Failed to help User continues → Successful
Session Length (Engagement)
code
Long session → Engaged Short session → Not engaged (or very efficient)
Explicit Feedback
Thumbs Up/Down
python
feedback = {
"question": "What is the capital of France?",
"answer": "Paris",
"feedback": "thumbs_up", # or "thumbs_down"
"timestamp": "2024-01-15T10:00:00Z"
}
# Aggregate
thumbs_up_rate = thumbs_up / (thumbs_up + thumbs_down)
Star Ratings (1-5)
python
rating = {
"question": "What is the capital of France?",
"answer": "Paris",
"rating": 5, # 1-5 scale
"timestamp": "2024-01-15T10:00:00Z"
}
# Aggregate
avg_rating = sum(ratings) / len(ratings)
Written Feedback
python
feedback = {
"question": "What is the capital of France?",
"answer": "Paris",
"feedback_text": "Perfect answer, very helpful!",
"timestamp": "2024-01-15T10:00:00Z"
}
# Analyze sentiment
sentiment = analyze_sentiment(feedback_text)
Bug Reports
python
bug_report = {
"question": "What is the capital of France?",
"answer": "Lyon", # Incorrect
"issue": "Incorrect answer",
"timestamp": "2024-01-15T10:00:00Z"
}
# Track and fix
add_to_ground_truth(question, correct_answer="Paris")
Bridging Offline and Online
Offline Metrics Should Correlate with Online
code
Hypothesis: Higher offline accuracy → Higher online satisfaction Test: - Model A: 85% offline accuracy → 75% thumbs up - Model B: 90% offline accuracy → 80% thumbs up Correlation: ✓ Offline predicts online
Validate Offline Improvements Lead to Online Gains
code
Process: 1. Improve offline metric (85% → 90% accuracy) 2. Deploy to production (A/B test) 3. Measure online metric (75% → 80% thumbs up) 4. If online improves → Offline metric is good proxy 5. If online doesn't improve → Offline metric is misleading
Offline for Filtering, Online for Final Decision
code
Workflow: 1. Offline: Test 10 model variants 2. Filter: Keep top 3 (based on offline metrics) 3. Online: A/B test top 3 4. Ship: Best performer online
When Offline and Online Disagree
Example
code
Offline: Model A is better (90% vs 85% accuracy) Online: Model B performs better (80% vs 75% thumbs up)
Possible Reasons
Dataset Not Representative:
code
Test set: Simple questions Production: Complex, ambiguous questions → Test set doesn't match reality
Metric Doesn't Capture What Matters:
code
Offline: Accuracy (correct answer) Online: Helpfulness (useful answer) → Correct ≠ helpful
User Behavior Differs from Test Set:
code
Test set: Factual questions Production: Conversational queries → Different use cases
Trust Online (But Investigate Why)
code
Online metrics = actual user impact → Trust online But investigate: - Why did offline mislead? - Update offline dataset/metrics - Improve offline-online correlation
Continuous Evaluation
Log All Predictions + Outcomes
python
log_entry = {
"timestamp": "2024-01-15T10:00:00Z",
"question": "What is the capital of France?",
"answer": "Paris",
"model_version": "v1.2.0",
"latency_ms": 250,
"user_feedback": "thumbs_up",
"user_id": "user123"
}
db.logs.insert(log_entry)
Offline Eval on Recent Data
python
# Weekly: Evaluate on last week's data
recent_data = db.logs.find({
"timestamp": {"$gte": one_week_ago}
}).limit(1000)
# Evaluate
results = evaluate_model(recent_data)
# Compare to baseline
if results["accuracy"] < baseline_accuracy - 0.05:
alert("Model performance degraded!")
Online Eval via A/B Tests
code
Monthly: A/B test new model variant Measure: User satisfaction, task success Ship: If statistically significant improvement
Monitor Metrics Dashboard
code
Dashboard: - Offline metrics (accuracy, F1) - Online metrics (thumbs up rate, task success) - Latency (P50, P95, P99) - Error rate - Traffic volume Alerts: - Accuracy drops >5% - Thumbs up rate drops >10% - Latency P95 >1s - Error rate >1%
Guardrails for Online Eval
Automated Rollback (If Metrics Drop)
python
def monitor_canary():
while True:
metrics = get_canary_metrics()
if metrics["error_rate"] > 0.05:
rollback()
alert("High error rate, rolled back")
if metrics["thumbs_down_rate"] > 0.3:
rollback()
alert("High dissatisfaction, rolled back")
time.sleep(60) # Check every minute
Manual Review (Before Wide Rollout)
code
Process: 1. Canary to 1% (automated) 2. Monitor for 24 hours 3. Manual review (check logs, user feedback) 4. If good, increase to 10% 5. Repeat until 100%
Sampling (Don't Test on All Traffic)
code
Start small: - 1% canary (low risk) - Monitor closely - Gradually increase - Never test unproven model on 100% traffic
Reversibility (Easy to Revert)
code
Feature flags: - Easy to turn off new model - Instant rollback - No code deployment needed
Offline-to-Online Workflow
1. Develop: Offline Eval on Test Set
code
Iterate on model Evaluate on test set (1000 examples) Improve until offline metrics good
2. Validate: Shadow Mode (Offline on Live Traffic)
code
Run in shadow mode (1 week) Evaluate on live traffic (offline metrics) Compare to current model
3. Test: Canary to 1% (Online, Minimal Risk)
code
Deploy to 1% of users Monitor online metrics (24 hours) If good, proceed
4. Expand: Gradual Rollout to 100%
code
1% → 5% → 10% → 25% → 50% → 100% Monitor at each step Rollback if issues
5. Monitor: Continuous Online Evaluation
code
Track metrics in production Detect regressions A/B test improvements
Real-World Examples
Search Ranking
code
Offline: NDCG@10 (ranking quality) Online: Click-through rate (CTR) Correlation: High NDCG → High CTR
Recommendation
code
Offline: Precision@k, recall@k Online: Engagement (clicks, watch time) Correlation: High precision → High engagement
RAG
code
Offline: Faithfulness, relevance Online: Thumbs up rate, task success Correlation: High faithfulness → High thumbs up
Tools
Offline
- •Custom scripts
- •Evaluation frameworks (RAGAS, DeepEval)
- •Jupyter notebooks
Online
- •Experimentation platforms (Optimizely, LaunchDarkly)
- •Analytics (Google Analytics, Mixpanel)
- •APM (Datadog, New Relic)
Both
- •MLOps platforms (MLflow, Weights & Biases)
- •Feature flags (LaunchDarkly, Split)
- •Monitoring (Prometheus, Grafana)
Summary
Offline: Test on static dataset (fast, safe, reproducible) Online: Test in production (real performance, actual impact)
Need Both:
- •Offline for rapid iteration
- •Online for final validation
Offline Pros:
- •Fast, safe, cheap, reproducible
Online Pros:
- •Real performance, actual impact
Online Methods:
- •A/B testing (compare variants)
- •Shadow mode (no user impact)
- •Canary (gradual rollout)
Online Metrics:
- •Engagement, satisfaction, task success, efficiency, safety
Bridging:
- •Offline should correlate with online
- •Validate improvements
- •Offline for filtering, online for decision
Workflow:
- •Develop (offline)
- •Validate (shadow mode)
- •Test (canary 1%)
- •Expand (gradual rollout)
- •Monitor (continuous)
Guardrails:
- •Automated rollback
- •Manual review
- •Sampling
- •Reversibility