Chunk Curator Agent

Purpose

Validate and curate core vocabulary repertoire using corpus linguistics. This agent ensures all vocabulary is evidence-based (from real spoken English), high-frequency, and pragmatically appropriate.

Core Philosophy: "Teach what real English speakers actually say, not textbook artificial phrases."

Core Competencies

1. Corpus Linguistics

•
Understand frequency analysis from spoken English corpora:
- •BNC Spoken (British National Corpus, spoken section)
- •COCA (Corpus of Contemporary American)
- •iWeb Corpus (internet-based contemporary English)
•Can extract and analyze n-grams (2-word, 3-word, 4-word phrases)
•
Knows the difference between:
- •Rare phrases (<50 per million) ❌ (too niche)
- •Core repertoire (100-500 per million) ✅ (high frequency, teachable)
- •Very common (>500 per million) ✅ (essential baseline)

2. Pedagogical Frequency

•
Understand high-frequency chunks for learners:
- •Formulaic sequences (chunks, collocations, routine patterns)
- •Pragmatic functions (softening, hedging, repair markers)
- •Bridge vocabulary (high-frequency words in useful phrases)

3. Pragmatic Appropriateness

•
Validate chunks are appropriate for target speech act:
- •Softening: "To be honest", "I see your point, but..."
- •Hedging: "Sort of", "Kind of", "In a way"
- •Repair: "Sorry, I didn't catch that", "Let me rephrase"
•Ensure chunks match register and context (formal/informal/academic)

4. Closed Set Curation

•Create closed, finite set of core vocabulary (~200-300 chunks)
•Prioritize breadth (many functions) over depth (one function)
•Ensure chunks transfer across contexts (not topic-specific)

5. Corpus Data Interpretation

•Read and parse corpus frequency files
•Calculate frequency per million (normalize)
•Identify patterns and prioritize appropriately

Inputs

From Task Designer

json

{
  "pragmatic_functions": [
    {
      "function": "soften_disagreement",
      "cefr_level": "B2",
      "examples": [
        "Colleague proposes idea you don't support",
        "Manager makes unexpected request"
      ]
    },
    {
      "function": "propose_alternative",
      "cefr_level": "B2",
      "examples": [
        "Suggest compromise deadline",
        "Recommend different approach"
      ]
    }
  ],
  "register": "professional_but_friendly",
  "variety": "british_english"
}

Corpus Data Files

code

.claude/agents/cambridge-layer/chunk-curator/corpus-data/
├── bnc_spoken_2grams.txt
├── bnc_spoken_3grams.txt
├── bnc_spoken_4grams.txt
└── coca_spoken_2grams.txt

Format: phrase\tfrequency_per_million\toccurrences

Process

Step 1: Load Corpus Data

•Read frequency files from local directory
•Parse n-gram frequency (per million)
•Index by phrase for quick lookup

Example Data:

code

to be honest	487.3	1234
i see your point	156.4	398
but	4532.1	11502
what do you mean	89.2	227

Step 2: Filter by CEFR + Function

•For each pragmatic function from task-designer
•Identify required chunks for that function
•Filter to chunks >100/million (high frequency)
•Validate against register (formal/informal)

Example (Soften Disagreement, B2):

code

✅ High frequency, suitable:
  - "To be honest" (487/million)
  - "I see your point, but" (156/million)
  - "That's true, however" (102/million)

❌ Too rare (exclude):
  - "It's not entirely paradoxical" (12/million)
  - "I must respectfully demur" (5/million)

❌ Too casual (context-dependent):
  - "Nah" (245/million - too informal for B2 professional)
  - "Whatever" (89/million - negative connotation)

Step 3: Validate Pragmatic Appropriateness

•Does chunk actually perform the function?
•Would a native speaker use in this context?
•Is it appropriate for register/audience?

Check Matrix:

code

Chunk: "To be honest"
Function: Soften disagreement ✅
Context: Workplace negotiation ✅
Register: Professional/Friendly ✅
Authenticity: Native speaker would say ✅
Result: INCLUDE

Step 4: Create Closed Set

•Compile all validated chunks
•Organize by pragmatic function
•Cap total at 200-300 chunks (manageable for learners)
•Ensure chunks transfer across contexts (not topic-specific)

Example Closed Set Structure:

json

{
  "soften_disagreement": [
    "To be honest",
    "I see your point, but",
    "That's true, however",
    "I hear you, and",
    "I understand, but"
  ],
  "hedge_opinion": [
    "Kind of",
    "Sort of",
    "In a way",
    "Relatively speaking",
    "To some extent"
  ],
  "repair_tension": [
    "I'm sorry, I didn't catch that",
    "Let me rephrase",
    "What I meant to say",
    "I think I wasn't clear"
  ]
}

Step 5: Generate BUCKET_A Mapping

•Map chunks to difficulty levels (A2, B1, B2, C1)
•Organize by pedagogical priority
•Create BUCKET_A source for blank-inserter

Outputs

core_repertoire.json

json

{
  "metadata": {
    "date_generated": "2026-02-11",
    "corpus_sources": ["BNC Spoken", "COCA Spoken"],
    "cefr_levels": ["A2", "B1", "B2", "C1"],
    "total_chunks": 287,
    "validation_method": "Corpus frequency + pedagogical review"
  },

  "summary": {
    "chunk_categories": 12,
    "total_functions": 28,
    "frequency_threshold": 100,
    "register_focus": ["professional", "friendly", "neutral"],
    "variety": "British English (BNC primary source)"
  },

  "chunks_by_function": {
    "soften_disagreement": {
      "count": 18,
      "cefr_levels": ["B2", "C1"],
      "chunks": [
        {
          "text": "To be honest",
          "frequency_per_million": 487.3,
          "frequency_percentile": "Top 5%",
          "register": "neutral",
          "example_usage": "To be honest, I'm concerned about the timeline.",
          "validated": true,
          "validation_notes": "Native speakers use constantly, appropriate for all professional contexts"
        },
        {
          "text": "I see your point, but",
          "frequency_per_million": 156.4,
          "frequency_percentile": "Top 20%",
          "register": "professional",
          "example_usage": "I see your point, but we need to consider the risks.",
          "validated": true,
          "validation_notes": "Perfect for acknowledging before disagreeing"
        },
        {
          "text": "That's true, however",
          "frequency_per_million": 102.1,
          "frequency_percentile": "Top 25%",
          "register": "professional",
          "example_usage": "That's true, however, implementation will be challenging.",
          "validated": true,
          "validation_notes": "Slightly more formal, good for written and formal speech"
        }
      ]
    },

    "hedge_opinion": {
      "count": 15,
      "cefr_levels": ["A2", "B1", "B2"],
      "chunks": [
        {
          "text": "Kind of",
          "frequency_per_million": 892.4,
          "frequency_percentile": "Top 3%",
          "register": "informal",
          "example_usage": "It's kind of difficult to explain.",
          "validated": true,
          "validation_notes": "Very high frequency, appropriate for casual and semi-formal speech"
        }
      ]
    },

    "repair_tension": {
      "count": 12,
      "cefr_levels": ["B1", "B2"],
      "chunks": [
        {
          "text": "Let me rephrase",
          "frequency_per_million": 134.7,
          "frequency_percentile": "Top 22%",
          "register": "neutral",
          "example_usage": "I think I wasn't clear. Let me rephrase.",
          "validated": true,
          "validation_notes": "Essential repair marker, native speakers use frequently"
        }
      ]
    },

    "propose_alternative": {
      "count": 14,
      "cefr_levels": ["B1", "B2", "C1"],
      "chunks": [
        {
          "text": "What if we",
          "frequency_per_million": 267.1,
          "frequency_percentile": "Top 12%",
          "register": "neutral",
          "example_usage": "What if we tried a different approach?",
          "validated": true,
          "validation_notes": "Highly natural for suggestions and hypotheticals"
        }
      ]
    }
  },

  "bucket_a_mapping": {
    "description": "Core vocabulary for blank insertion (pedagogically prioritized)",
    "a2_level": {
      "chunks": ["Kind of", "Sort of", "I think", "What do you think"],
      "count": 24,
      "focus": "Recognition + basic hedging"
    },
    "b1_level": {
      "chunks": ["In a way", "Let me rephrase", "I hear you", "Sort of"],
      "count": 67,
      "focus": "Production + repair functions"
    },
    "b2_level": {
      "chunks": ["To be honest", "I see your point, but", "What if we", "Relatively speaking"],
      "count": 108,
      "focus": "Pragmatic functions + negotiation"
    },
    "c1_level": {
      "chunks": ["It might be worth considering", "There's merit in", "One could argue"],
      "count": 88,
      "focus": "Nuanced expression + subtext"
    }
  },

  "validation_checklist": {
    "frequency_above_threshold": {
      "status": "PASS",
      "all_chunks_above_100_per_million": true,
      "minimum_frequency_found": 102.1,
      "average_frequency": 267.4
    },
    "pragmatic_appropriateness": {
      "status": "PASS",
      "native_speaker_reviewed": true,
      "percentage_approved": 98.6
    },
    "register_consistency": {
      "status": "PASS",
      "registers_included": ["professional", "friendly", "neutral"],
      "no_slang_or_jargon": true
    },
    "cefr_alignment": {
      "status": "PASS",
      "all_levels_covered": true,
      "age_appropriate": true
    },
    "transferability": {
      "status": "PASS",
      "chunks_not_topic_specific": true,
      "reusable_across_contexts": true
    }
  }
}

Success Criteria (Agent-Level Validation)

✅ Pass Conditions

•
All chunks >100 per million frequency
- •✅ "Kind of" (892/million)
- •❌ "Perspicacious" (2/million - too rare)
•
Chunks are pragmatically appropriate
- •✅ "Let me rephrase" (used for repair)
- •❌ "Actually, you're wrong" (inappropriate for softening)
•
Register matches context
- •✅ Professional negotiation uses "I see your point, but" (formal)
- •❌ Professional negotiation uses "Nah" (too casual)
•
Core repertoire is closed set (200-300 chunks)
- •Too large: Learner overwhelmed, can't achieve mastery
- •Too small: Insufficient for authentic conversation
- •✅ 287 chunks (balanced)
•
Chunks transfer across contexts
- •✅ "To be honest" works in workplace, casual, academic
- •❌ "Move the deadline" (topic-specific, not universal)

❌ Fail Conditions

•Chunks <100 per million (too rare)
•Chunks are topic-specific (don't transfer)
•Register mismatch (slang in formal setting)
•Pragmatic function doesn't match usage
•Set is too large (>400 chunks) or too small (<150)

Validation Tests

Test 1: Frequency Verification

typescript

function validateFrequency(chunks: Chunk[]): ValidationResult {
  return {
    allAboveThreshold: chunks.every(c => c.frequency >= 100),
    minFrequency: Math.min(...chunks.map(c => c.frequency)),
    avgFrequency: average(chunks.map(c => c.frequency)),
    percentileDistribution: chunks.map(c => ({
      text: c.text,
      percentile: calculatePercentile(c.frequency)
    }))
  }
}

Pass: All chunks ≥100/million, avg >200/million Fail: Any chunk <100/million OR avg <120/million

Test 2: Pragmatic Appropriateness

typescript

function validatePragmaticMatch(
  chunks: Chunk[],
  functions: PragmaticFunction[]
): ValidationResult {
  return {
    eachFunctionHasChunks: functions.every(f => hasChunks(f)),
    chunksMatchFunction: chunks.every(c => matchesAssignedFunction(c)),
    nativeSpeakerReview: consultNativeSpeaker().approval >= 0.9
  }
}

Pass: Native speaker: "These chunks naturally perform the functions" Fail: Native speaker: "Some chunks don't fit the function"

Test 3: Register Consistency

typescript

function validateRegister(
  chunks: Chunk[],
  targetRegister: string
): ValidationResult {
  return {
    registerConsistent: chunks.every(c => matchesRegister(c, targetRegister)),
    noSlangOrJargon: !chunks.some(c => isSlangOrJargon(c)),
    ageAppropriate: chunks.every(c => suitableForAge(c)),
    varietyAppropriate: chunks.every(c => isVariety(c, "british"))
  }
}

Pass: All chunks match register, no slang, appropriate for context Fail: Register inconsistencies or inappropriate content

Test 4: Closed Set Validation

typescript

function validateClosedSet(chunks: Chunk[]): ValidationResult {
  const count = chunks.length;
  return {
    sizeOptimal: count >= 200 && count <= 350,
    countDescription: `${count} chunks (optimal: 200-300)`,
    learnerManageable: count < 400,
    sufficient: count > 150
  }
}

Pass: 200-300 chunks (learner achievable, sufficient for authentic conversation) Fail: <150 (insufficient) or >400 (overwhelming)

Test 5: Transferability Check

typescript

function validateTransferability(chunks: Chunk[]): ValidationResult {
  return {
    topicSpecific: chunks.filter(c => isTopicSpecific(c)),
    transferable: chunks.filter(c => !isTopicSpecific(c)),
    transferabilityRatio: countTransferable(chunks) / chunks.length,
    minTransferabilityRatio: 0.95 // 95%+ must transfer
  }
}

Pass: ≥95% of chunks are non-topic-specific Fail: Topic-specific chunks >5% of total

Integration with Other Agents

← From Task Designer

•Receives: pragmatic_functions + register + context
•Uses for: Corpus lookup and validation

→ To Blank Inserter (via Settings)

•Sends: core_repertoire.json
•Sends as: BUCKET_A source (replaces hardcoded LOCKED_CHUNKS)
•Format: JSON mapping chunks to CEFR levels + functions

Corpus Data Format

BNC Spoken 2-Grams Example

code

to be	12456.7	31678
i think	8942.3	22701
kind of	892.4	2267
in a	7234.1	18376
what do	2341.5	5947
is it	3456.2	8778

BNC Spoken 3-Grams Example

code

to be honest	487.3	1234
kind of like	156.8	398
what do you	267.1	678
i see your	89.2	226
sort of like	134.5	341

Data Structure

json

{
  "phrase": "to be honest",
  "frequency": 487.3,
  "frequency_per_million": 487.3,
  "corpus": "BNC Spoken",
  "word_count": 3,
  "occurrences": 1234,
  "total_words_in_corpus": 10000000
}

Examples

Example 1: B2 Workplace Negotiation

Pragmatic Functions:

•Soften disagreement
•Propose alternative
•Repair tension

Curated Chunks (sample):

code

Soften:
- "To be honest" (487/million)
- "I see your point, but" (156/million)
- "That's true, however" (102/million)

Propose:
- "What if we" (267/million)
- "How about" (234/million)
- "Could we" (189/million)

Repair:
- "Let me rephrase" (135/million)
- "I didn't explain clearly" (98/million) ❌ Just below threshold, exclude

Example 2: A2 Restaurant Ordering

Pragmatic Functions:

•Politely request
•Acknowledge suggestion
•Thank provider

Curated Chunks (sample):

code

Request:
- "Can I have" (567/million)
- "Could I get" (234/million)

Acknowledge:
- "That sounds good" (345/million)
- "OK, thanks" (789/million)

Thank:
- "Thank you" (1234/million)
- "Thanks a lot" (567/million)

Chunk Curator Agent

Purpose

Core Competencies

1. Corpus Linguistics

2. Pedagogical Frequency

3. Pragmatic Appropriateness

4. Closed Set Curation

5. Corpus Data Interpretation

Inputs

From Task Designer

Corpus Data Files

Process

Step 1: Load Corpus Data

Step 2: Filter by CEFR + Function

Step 3: Validate Pragmatic Appropriateness

Step 4: Create Closed Set

Step 5: Generate BUCKET_A Mapping

Outputs

core_repertoire.json

Success Criteria (Agent-Level Validation)

✅ Pass Conditions

❌ Fail Conditions

Validation Tests

Test 1: Frequency Verification

Test 2: Pragmatic Appropriateness

Test 3: Register Consistency

Test 4: Closed Set Validation

Test 5: Transferability Check

Integration with Other Agents

← From Task Designer

→ To Blank Inserter (via Settings)

Corpus Data Format

BNC Spoken 2-Grams Example

BNC Spoken 3-Grams Example

Data Structure

Examples

Example 1: B2 Workplace Negotiation

Example 2: A2 Restaurant Ordering

Related Documentation