Ground Truth Collector

Overview

Design effective data annotation workflows that produce high-quality labeled data for AI training. Define guidelines, quality control, and labeler training.

Core principle: Label quality determines model ceiling. Invest in annotation quality upfront to avoid model quality issues later.

When to Use

•Starting new labeling project
•Improving label quality
•Scaling annotation team
•Defining new label schema

Output Format

yaml

annotation_project:
  name: "[Project name]"
  date: "[YYYY-MM-DD]"
  
  scope:
    data_type: "[Text | Image | Audio | etc.]"
    task_type: "[Classification | NER | Bounding box | etc.]"
    volume: "[Total items to label]"
    timeline: "[Duration]"
  
  label_schema:
    labels:
      - label: "[Label name]"
        definition: "[Clear definition]"
        examples:
          positive: ["[Example that IS this label]"]
          negative: ["[Example that is NOT this label]"]
        edge_cases: ["[Guidance for ambiguous cases]"]
  
  guidelines:
    document: "[Link to full guidelines]"
    key_rules:
      - "[Rule 1]"
      - "[Rule 2]"
    
    decision_tree:
      - condition: "[If this]"
        then: "[Label as this]"
  
  quality_control:
    inter_annotator_agreement:
      metric: "[Cohen's Kappa | Fleiss Kappa | etc.]"
      threshold: "[Acceptable level]"
    
    review_process:
      method: "[Gold standard | Expert review | Consensus]"
      sample_rate: "[% reviewed]"
    
    feedback_loop:
      - "[How labelers get feedback]"
  
  labeler_training:
    onboarding:
      - "[Training step 1]"
    qualification:
      test: "[How to test readiness]"
      passing: "[Threshold]"
  
  workflow:
    tool: "[Annotation tool]"
    batching: "[How work is assigned]"
    escalation: "[How to handle unclear cases]"
  
  metrics:
    throughput: "[Items per hour target]"
    quality: "[Accuracy target]"
    consistency: "[IAA target]"

Annotation Guidelines Template

Purpose

yaml

guideline_structure:
  overview:
    - "Task description"
    - "Why this matters"
    - "How labels will be used"
  
  label_definitions:
    - label: "[Label]"
      definition: "[1-2 sentence definition]"
      criteria: ["[What makes something this label]"]
  
  examples:
    - type: "Clear positive"
      example: "[Example]"
      label: "[Label]"
      explanation: "[Why]"
    
    - type: "Clear negative"
      example: "[Example]"
      not_label: "[Label]"
      explanation: "[Why not]"
    
    - type: "Edge case"
      example: "[Ambiguous example]"
      label: "[Correct label]"
      reasoning: "[How to decide]"
  
  common_mistakes:
    - mistake: "[What labelers often do wrong]"
      correction: "[What to do instead]"

Quality Control Methods

Inter-Annotator Agreement (IAA)

yaml

iaa_methods:
  two_annotators:
    metric: "Cohen's Kappa"
    formula: "(Observed agreement - Expected) / (1 - Expected)"
    interpretation:
      ">0.8": "Excellent agreement"
      "0.6-0.8": "Good agreement"
      "0.4-0.6": "Moderate agreement"
      "<0.4": "Poor agreement - review guidelines"
  
  multiple_annotators:
    metric: "Fleiss' Kappa"
    use_when: "3+ annotators per item"

Review Strategies

Strategy	Description	When to Use
Gold standard	Compare to expert-labeled set	New labelers, quality checks
Expert review	Expert reviews sample	High-stakes data
Consensus	Multiple labelers, take majority	Subjective tasks
Adjudication	Expert resolves disagreements	When consensus fails

Labeler Training

Onboarding Process

yaml

training_process:
  1_guidelines_review:
    duration: "30-60 min"
    content: "Read and understand guidelines"
    quiz: "Check comprehension"
  
  2_guided_practice:
    duration: "1-2 hours"
    content: "Label examples with feedback"
    sample_size: "20-50 items"
  
  3_qualification_test:
    duration: "30-60 min"
    content: "Unsupervised labeling"
    sample_size: "30-50 items"
    passing: ">90% agreement with gold standard"
  
  4_gradual_independence:
    content: "Start live work with higher review rate"
    review_rate: "100% → 20% as quality proves out"

Ongoing Quality

yaml

ongoing_quality:
  regular_checks:
    - "Insert gold standard items periodically"
    - "Track individual labeler performance"
    - "Provide feedback on errors"
  
  calibration:
    frequency: "Weekly or after guideline changes"
    method: "All labelers label same set, discuss disagreements"
  
  retraining:
    trigger: "Quality drops below threshold"
    process: "Focused training on error patterns"

Workflow Design

Tool Selection Criteria

yaml

tool_criteria:
  must_have:
    - "Supports data type and task type"
    - "Quality control features"
    - "Export in needed format"
  
  nice_to_have:
    - "Active learning suggestions"
    - "Pre-labeling automation"
    - "Labeler performance analytics"

Batch Assignment

yaml

batching_strategy:
  size: "50-100 items per batch"
  assignment: "Random or stratified"
  overlap: "10-20% double-labeled for IAA"

Checklist

• Label schema defined
• Guidelines documented with examples
• Quality control process defined
• Labeler training designed
• Tool selected and configured
• Workflow documented
• Metrics and thresholds set