Ground Truth Management
What is Ground Truth?
Definition: Correct answers for evaluation - human-verified data that serves as the gold standard for measuring AI performance.
Example
code
Question: "What is the capital of France?" Ground Truth: "Paris" AI Answer: "Paris" → Correct ✓ AI Answer: "Lyon" → Incorrect ✗
Why Ground Truth Matters
Measure Accuracy Objectively
code
Without ground truth: "This answer seems good" (subjective) With ground truth: "Accuracy: 85%" (objective)
Train and Validate Models
code
Training: Learn from ground truth examples Validation: Measure performance on ground truth test set
Regression Testing
code
Before change: Accuracy 90% After change: Accuracy 85% → Regression detected!
Benchmarking
code
Model A: 90% accuracy on ground truth Model B: 85% accuracy on ground truth → Model A is better
Types of Ground Truth
Exact Match: Single Correct Answer
json
{
"question": "What is 2+2?",
"answer": "4"
}
Multiple Acceptable Answers
json
{
"question": "What is the capital of France?",
"acceptable_answers": ["Paris", "paris", "PARIS", "The capital is Paris"]
}
Rubric-Based: Quality Scale
json
{
"question": "Summarize this article",
"rubric": {
"1": "Poor summary, missing key points",
"3": "Adequate summary, covers main points",
"5": "Excellent summary, concise and comprehensive"
}
}
Human Preference: Comparison Rankings
json
{
"question": "Which answer is better?",
"answer_a": "Paris is the capital of France.",
"answer_b": "The capital of France is Paris, a city of 2.1 million people.",
"preference": "B",
"reasoning": "More informative"
}
Creating Ground Truth
Manual Annotation (Humans Label)
code
Process: 1. Collect examples (questions, documents, images) 2. Human annotators label each 3. Quality control (review annotations) 4. Store in dataset
Expert Review (For Specialized Domains)
code
Medical: Doctors annotate Legal: Lawyers annotate Technical: Engineers annotate Higher quality but more expensive
Crowdsourcing (Amazon MTurk)
code
Pros: - Fast (many workers) - Cheap ($0.10-1.00 per annotation) Cons: - Variable quality - Need quality control
Synthetic Generation (For Some Tasks)
code
LLM-generated questions + answers Careful validation needed Good for scale, risky for quality Use for augmentation, not sole source
Ground Truth Dataset Structure
Input (Question, Document, Image)
json
{
"input": {
"type": "question",
"text": "What is the capital of France?"
}
}
Expected Output (Answer, Label, Summary)
json
{
"expected_output": {
"type": "answer",
"text": "Paris",
"acceptable_variants": ["paris", "PARIS"]
}
}
Metadata (Difficulty, Category, Source)
json
{
"metadata": {
"difficulty": "easy",
"category": "geography",
"source": "wikipedia",
"language": "en"
}
}
Annotation Info (Who, When, Confidence)
json
{
"annotation": {
"annotator_id": "annotator_123",
"timestamp": "2024-01-15T10:00:00Z",
"confidence": 0.95,
"time_spent_seconds": 30
}
}
Complete Example:
json
{
"id": "example_001",
"input": {
"type": "question",
"text": "What is the capital of France?"
},
"expected_output": {
"type": "answer",
"text": "Paris",
"acceptable_variants": ["paris", "PARIS", "The capital is Paris"]
},
"metadata": {
"difficulty": "easy",
"category": "geography",
"source": "wikipedia",
"language": "en"
},
"annotation": {
"annotator_id": "annotator_123",
"timestamp": "2024-01-15T10:00:00Z",
"confidence": 0.95
}
}
Annotation Guidelines
Clear Instructions
markdown
# Annotation Guidelines ## Task Label whether the answer is correct. ## Instructions 1. Read the question carefully 2. Read the answer 3. Determine if answer is factually correct 4. Mark as "Correct" or "Incorrect" ## Examples Question: "What is 2+2?" Answer: "4" Label: Correct Question: "What is 2+2?" Answer: "5" Label: Incorrect
Examples (Good and Bad)
markdown
## Good Example Question: "What is the capital of France?" Answer: "Paris" Label: Correct Reasoning: Factually accurate and directly answers question ## Bad Example Question: "What is the capital of France?" Answer: "France is a country in Europe" Label: Incorrect Reasoning: Doesn't answer the question
Edge Case Handling
markdown
## Edge Cases ### Partially Correct Question: "What are the capitals of France and Germany?" Answer: "Paris" Label: Partially Correct (missing Germany) ### Ambiguous Question Question: "What is the best programming language?" Label: N/A - Subjective question, no single correct answer ### No Answer in Context Question: "What is the population of Paris?" Context: "Paris is the capital of France." Label: "Cannot be determined from context"
Consistency Checks
markdown
## Consistency Rules
1. Same question → Same answer
2. Synonyms are acceptable ("car" = "automobile")
3. Case-insensitive ("Paris" = "paris")
4. Extra details are OK ("Paris" vs "Paris, France")
Quality Control
Multiple Annotators Per Example
code
Each example labeled by 3 annotators Majority vote determines final label Catches individual annotator errors
Inter-Annotator Agreement (IAA)
code
Measure: Do annotators agree? Metric: Cohen's Kappa (κ) Target: κ > 0.7 (good agreement)
Gold Standard Subset (Known Answers)
code
10% of examples have known correct labels Mix into annotation tasks Measure annotator accuracy on gold standard Remove low-quality annotators
Spot Checks by Experts
code
Expert reviews 10% of annotations Validates quality Identifies systematic errors
Inter-Annotator Agreement
Kappa Score (Cohen's κ)
python
from sklearn.metrics import cohen_kappa_score
annotator1 = [1, 0, 1, 1, 0] # Labels from annotator 1
annotator2 = [1, 0, 1, 0, 0] # Labels from annotator 2
kappa = cohen_kappa_score(annotator1, annotator2)
print(f"Kappa: {kappa:.2f}")
# Interpretation:
# κ < 0.4: Poor agreement
# κ 0.4-0.6: Moderate agreement
# κ 0.6-0.8: Good agreement
# κ > 0.8: Excellent agreement
Fleiss' κ (Multiple Annotators)
python
from statsmodels.stats.inter_rater import fleiss_kappa
# 3 annotators, 5 examples
# Each row: [count_label_0, count_label_1]
data = [
[0, 3], # Example 1: All 3 annotators chose label 1
[1, 2], # Example 2: 1 chose 0, 2 chose 1
[3, 0], # Example 3: All 3 chose label 0
[2, 1], # Example 4: 2 chose 0, 1 chose 1
[0, 3], # Example 5: All 3 chose label 1
]
kappa = fleiss_kappa(data)
print(f"Fleiss' Kappa: {kappa:.2f}")
Percentage Agreement
python
def percentage_agreement(annotator1, annotator2):
agreements = sum(a == b for a, b in zip(annotator1, annotator2))
total = len(annotator1)
return agreements / total
agreement = percentage_agreement(annotator1, annotator2)
print(f"Agreement: {agreement:.1%}")
Target: >0.7 (Good Agreement)
code
If κ < 0.7: 1. Review annotation guidelines (unclear?) 2. Provide more examples 3. Train annotators 4. Simplify task
Resolving Disagreements
Majority Vote
python
def majority_vote(labels):
from collections import Counter
counts = Counter(labels)
majority_label = counts.most_common(1)[0][0]
return majority_label
# 3 annotators
labels = [1, 1, 0] # Two say 1, one says 0
final_label = majority_vote(labels) # 1
Expert Adjudication
code
If no majority (e.g., 1, 0, 2): → Expert reviews and decides
Discussion and Consensus
code
Annotators discuss disagreement Reach consensus Update guidelines if needed
Update Guidelines
code
If systematic disagreements: → Guidelines unclear → Update and re-annotate
Ground Truth for Different Tasks
Classification: Category Labels
json
{
"text": "This product is amazing!",
"label": "positive"
}
Q&A: Correct Answers + Acceptable Variants
json
{
"question": "What is the capital of France?",
"answer": "Paris",
"acceptable_variants": ["paris", "PARIS", "The capital is Paris"]
}
Summarization: Reference Summaries
json
{
"document": "Long article text...",
"reference_summary": "Concise summary of key points"
}
RAG: Question + Context + Answer
json
{
"question": "What is the capital of France?",
"context": "Paris is the capital and largest city of France.",
"answer": "Paris",
"relevant_chunks": ["Paris is the capital and largest city of France."]
}
Generation: Multiple Acceptable Outputs
json
{
"prompt": "Write a haiku about spring",
"acceptable_outputs": [
"Cherry blossoms bloom\nGentle breeze carries petals\nSpring has arrived now",
"Flowers start to bloom\nBirds sing in the morning light\nSpring is here at last"
]
}
Dataset Size
Evaluation Set: 100-1000 Examples (Representative)
code
Purpose: Quick evaluation during development Size: 100-1000 examples Quality: High (manually curated) Coverage: Representative of production
Test Set: 500-5000 Examples (Comprehensive)
code
Purpose: Final evaluation before deployment Size: 500-5000 examples Quality: High (gold standard) Coverage: Comprehensive (all categories, edge cases)
Quality > Quantity
code
Better: 100 high-quality examples Worse: 1000 low-quality examples
Cover Edge Cases
code
Include: - Common cases (80%) - Edge cases (15%) - Adversarial cases (5%)
Dataset Maintenance
Version Control (Like Code)
bash
# Git for dataset versioning git init git add dataset.jsonl git commit -m "Initial dataset v1.0" # Tag versions git tag v1.0 # Update dataset git add dataset.jsonl git commit -m "Added 100 new examples" git tag v1.1
Regular Updates (New Examples)
code
Monthly: Add 50-100 new examples Quarterly: Major update (500+ examples)
Remove Outdated Examples
code
Examples that are: - No longer relevant - Incorrect (facts changed) - Duplicates
Track Changes (Changelog)
markdown
# Dataset Changelog ## v1.2 (2024-02-01) - Added 100 new examples (geography category) - Removed 20 outdated examples - Fixed 5 incorrect labels ## v1.1 (2024-01-01) - Added 50 new examples (science category) - Updated annotation guidelines ## v1.0 (2023-12-01) - Initial release (500 examples)
Stratified Sampling
Balance by Difficulty
code
Easy: 40% Medium: 40% Hard: 20%
Balance by Category
code
Geography: 25% Science: 25% History: 25% Math: 25%
Include Edge Cases
code
Common cases: 80% Edge cases: 15% Adversarial: 5%
Representative of Production
code
Sample from actual production queries Ensures dataset matches real usage
Synthetic Ground Truth
LLM-Generated Questions + Answers
python
def generate_synthetic_qa(document):
prompt = f"""
Document: {document}
Generate 5 question-answer pairs based on this document.
Format:
Q1: [question]
A1: [answer]
...
"""
response = llm.generate(prompt)
qa_pairs = parse_qa_pairs(response)
return qa_pairs
Careful Validation Needed
code
LLM-generated data can have: - Hallucinations - Incorrect facts - Biased questions → Always validate with humans
Good for Scale, Risky for Quality
code
Pros: Can generate 1000s quickly Cons: Quality varies, needs validation
Use for Augmentation, Not Sole Source
code
Strategy: - 80% human-annotated (high quality) - 20% synthetic (validated)
Domain-Specific Ground Truth
Medical: Expert Annotations
code
Annotators: Licensed doctors Cost: $50-100 per hour Quality: Very high Use case: Medical diagnosis, treatment recommendations
Legal: Lawyer Review
code
Annotators: Licensed lawyers Cost: $100-300 per hour Quality: Very high Use case: Legal document analysis, case law
Technical: Engineer Verification
code
Annotators: Senior engineers Cost: $50-150 per hour Quality: High Use case: Code review, technical Q&A
Ground Truth Storage
JSON/JSONL Files
jsonl
{"id": "1", "question": "What is 2+2?", "answer": "4"}
{"id": "2", "question": "Capital of France?", "answer": "Paris"}
Database (PostgreSQL, MongoDB)
sql
CREATE TABLE ground_truth ( id UUID PRIMARY KEY, question TEXT NOT NULL, answer TEXT NOT NULL, category VARCHAR(50), difficulty VARCHAR(20), created_at TIMESTAMP DEFAULT NOW() );
Version Control (Git)
bash
git add dataset/ git commit -m "Update ground truth dataset" git push
Cloud Storage (S3 + Versioning)
bash
# Upload to S3 with versioning aws s3 cp dataset.jsonl s3://my-bucket/ground-truth/v1.0/dataset.jsonl aws s3api put-bucket-versioning --bucket my-bucket --versioning-configuration Status=Enabled
Ground Truth for RAG
Structure:
json
{
"question": "What is the capital of France?",
"expected_answer": "Paris",
"relevant_document_chunks": [
"Paris is the capital and largest city of France."
],
"evaluation_criteria": {
"faithfulness": "Answer must be grounded in context",
"relevance": "Answer must directly address question",
"completeness": "Answer should mention Paris"
}
}
Evaluation with Ground Truth
Exact Match Accuracy
python
def exact_match(predicted, ground_truth):
return predicted.strip().lower() == ground_truth.strip().lower()
accuracy = sum(exact_match(p, gt) for p, gt in zip(predicted, ground_truth)) / len(predicted)
F1 Score (For Overlapping Spans)
python
def f1_score(predicted, ground_truth):
pred_tokens = set(predicted.lower().split())
gt_tokens = set(ground_truth.lower().split())
common = pred_tokens & gt_tokens
if len(pred_tokens) == 0 or len(gt_tokens) == 0:
return 0
precision = len(common) / len(pred_tokens)
recall = len(common) / len(gt_tokens)
if precision + recall == 0:
return 0
f1 = 2 * (precision * recall) / (precision + recall)
return f1
BLEU/ROUGE (For Generation)
python
from nltk.translate.bleu_score import sentence_bleu reference = [["Paris", "is", "the", "capital"]] candidate = ["Paris", "is", "the", "capital"] bleu = sentence_bleu(reference, candidate)
Semantic Similarity (Embedding Distance)
python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('all-MiniLM-L6-v2')
emb1 = model.encode("Paris is the capital of France")
emb2 = model.encode("The capital of France is Paris")
similarity = cosine_similarity([emb1], [emb2])[0][0]
Continuous Ground Truth
Production Feedback (User Thumbs Up/Down)
python
# Log user feedback
feedback = {
"question": "What is the capital of France?",
"answer": "Paris",
"user_feedback": "thumbs_up",
"timestamp": "2024-01-15T10:00:00Z"
}
# Add to ground truth if positive
if feedback["user_feedback"] == "thumbs_up":
add_to_ground_truth(feedback["question"], feedback["answer"])
Human Review of Flagged Outputs
code
User flags answer as incorrect → Human reviews → If incorrect, add correct answer to ground truth → If correct, keep as is
Incrementally Add to Dataset
code
Monthly: Review 100 flagged examples Add 50 to ground truth Update dataset version
Tools
Annotation: Label Studio, Prodigy, CVAT
Label Studio:
bash
pip install label-studio label-studio start # Open http://localhost:8080
Prodigy:
bash
pip install prodigy prodigy textcat.manual dataset_name source.jsonl --label positive,negative
Management: DVC (Data Version Control)
bash
pip install dvc dvc init dvc add dataset.jsonl git add dataset.jsonl.dvc .gitignore git commit -m "Add dataset" dvc push
Storage: S3, GCS, Local Files
See "Ground Truth Storage" section
Summary
Ground Truth: Correct answers for evaluation
Why:
- •Measure accuracy objectively
- •Train/validate models
- •Regression testing
- •Benchmarking
Types:
- •Exact match
- •Multiple acceptable answers
- •Rubric-based
- •Human preference
Creating:
- •Manual annotation
- •Expert review
- •Crowdsourcing
- •Synthetic (with validation)
Quality Control:
- •Multiple annotators
- •Inter-annotator agreement (κ > 0.7)
- •Gold standard subset
- •Expert spot checks
Dataset Size:
- •Eval: 100-1000 (representative)
- •Test: 500-5000 (comprehensive)
- •Quality > quantity
Maintenance:
- •Version control (Git)
- •Regular updates
- •Remove outdated
- •Changelog
Tools:
- •Annotation: Label Studio, Prodigy
- •Management: DVC
- •Storage: S3, GCS, Git