Ground Truth Labeling
Purpose
Establish rigorous, defensible ground truth labels for evaluation datasets. Ensures labels derive from authoritative sources rather than intuition, and documents reasoning for reproducibility and auditability.
When to Use
Invoke when:
- •Creating ground truth labels for a new dataset
- •Reviewing records with high scorer/judge disagreement
- •Refining existing labels based on new understanding
- •Auditing label consistency across a dataset
Core Principle: Guidelines Are Authoritative
Ground truth derives from explicit guidelines, not intuition.
When labeling, the answer must come from the established criteria themselves, not from general judgment about what "should" be the case.
❌ "This seems like good journalism, so it shouldn't be flagged" ✅ "Guideline X permits quoting harmful language when [condition]. This article meets that condition."
Workflow
1. Load Relevant Guidelines
Before labeling any record, load and review the authoritative criteria:
- •What rules apply?
- •What are the explicit conditions for violation/non-violation?
- •What edge cases does the guideline address?
2. Analyze the Record
For each record:
- •Identify potential issues (terminology, framing, sources, etc.)
- •For each issue, find the specific guideline provision that applies
- •Determine if the guideline's conditions for violation are met
3. Construct the Label
Label structure:
ground_truth: violating: true/false reasons: - Primary reason with guideline reference - 'OPTIONAL: Secondary observation that scorers need not require'
Reason categories:
- •Primary: Scorers should expect judges to identify this
- •OPTIONAL: Valid observation that reasonable judges might not mention
4. Document Ambiguity
High disagreement signals:
- •Ambiguity in the guidelines themselves
- •Cases where guidelines conflict or don't clearly apply
- •Need to consult authoritative sources
When encountering genuine ambiguity, document it - don't force a label.
OPTIONAL Reasons
Prefix with "OPTIONAL:" for secondary observations:
- •Scorers should not require judges to mention these
- •If a judge does comment, scorers should expect correctness
- •Captures edge cases or nuanced guideline applications
Example:
reasons: - Article provides critical framing and therefore DOES NOT VIOLATE quote attribution rules. - 'OPTIONAL: Uses "activists" - guidelines discourage this when implying negative connotations, but usage here is neutral.'
Common Labeling Pitfalls
| Pitfall | Correction |
|---|---|
| Labeling by intuition | Find explicit guideline provision |
| Assuming guidelines agree | Check each criterion separately |
| Over-strict interpretation | Guidelines often permit with conditions |
| Ignoring context | Most guidelines consider framing/purpose |
| Binary thinking | Use OPTIONAL for nuanced observations |
Consistency Checks
When refining labels:
- •Same reasoning → same label: If two records have the same characteristic, they should have the same label
- •Document changes: Log all label changes with rationale
- •Test edge cases: Does this label imply changes to similar records?
Output
For each labeling decision, provide:
- •The label (violating: true/false)
- •Primary reason(s) with guideline references
- •Any OPTIONAL observations
- •Rationale connecting guideline text to record content