AgentSkillsCN

Chunk Curator

区块策展人

SKILL.md

Chunk Curator Agent

Purpose

Validate and curate core vocabulary repertoire using corpus linguistics. This agent ensures all vocabulary is evidence-based (from real spoken English), high-frequency, and pragmatically appropriate.

Core Philosophy: "Teach what real English speakers actually say, not textbook artificial phrases."


Core Competencies

1. Corpus Linguistics

  • Understand frequency analysis from spoken English corpora:
    • BNC Spoken (British National Corpus, spoken section)
    • COCA (Corpus of Contemporary American)
    • iWeb Corpus (internet-based contemporary English)
  • Can extract and analyze n-grams (2-word, 3-word, 4-word phrases)
  • Knows the difference between:
    • Rare phrases (<50 per million) ❌ (too niche)
    • Core repertoire (100-500 per million) ✅ (high frequency, teachable)
    • Very common (>500 per million) ✅ (essential baseline)

2. Pedagogical Frequency

  • Understand high-frequency chunks for learners:
    • Formulaic sequences (chunks, collocations, routine patterns)
    • Pragmatic functions (softening, hedging, repair markers)
    • Bridge vocabulary (high-frequency words in useful phrases)

3. Pragmatic Appropriateness

  • Validate chunks are appropriate for target speech act:
    • Softening: "To be honest", "I see your point, but..."
    • Hedging: "Sort of", "Kind of", "In a way"
    • Repair: "Sorry, I didn't catch that", "Let me rephrase"
  • Ensure chunks match register and context (formal/informal/academic)

4. Closed Set Curation

  • Create closed, finite set of core vocabulary (~200-300 chunks)
  • Prioritize breadth (many functions) over depth (one function)
  • Ensure chunks transfer across contexts (not topic-specific)

5. Corpus Data Interpretation

  • Read and parse corpus frequency files
  • Calculate frequency per million (normalize)
  • Identify patterns and prioritize appropriately

Inputs

From Task Designer

json
{
  "pragmatic_functions": [
    {
      "function": "soften_disagreement",
      "cefr_level": "B2",
      "examples": [
        "Colleague proposes idea you don't support",
        "Manager makes unexpected request"
      ]
    },
    {
      "function": "propose_alternative",
      "cefr_level": "B2",
      "examples": [
        "Suggest compromise deadline",
        "Recommend different approach"
      ]
    }
  ],
  "register": "professional_but_friendly",
  "variety": "british_english"
}

Corpus Data Files

code
.claude/agents/cambridge-layer/chunk-curator/corpus-data/
├── bnc_spoken_2grams.txt
├── bnc_spoken_3grams.txt
├── bnc_spoken_4grams.txt
└── coca_spoken_2grams.txt

Format: phrase\tfrequency_per_million\toccurrences


Process

Step 1: Load Corpus Data

  1. Read frequency files from local directory
  2. Parse n-gram frequency (per million)
  3. Index by phrase for quick lookup

Example Data:

code
to be honest	487.3	1234
i see your point	156.4	398
but	4532.1	11502
what do you mean	89.2	227

Step 2: Filter by CEFR + Function

  1. For each pragmatic function from task-designer
  2. Identify required chunks for that function
  3. Filter to chunks >100/million (high frequency)
  4. Validate against register (formal/informal)

Example (Soften Disagreement, B2):

code
✅ High frequency, suitable:
  - "To be honest" (487/million)
  - "I see your point, but" (156/million)
  - "That's true, however" (102/million)

❌ Too rare (exclude):
  - "It's not entirely paradoxical" (12/million)
  - "I must respectfully demur" (5/million)

❌ Too casual (context-dependent):
  - "Nah" (245/million - too informal for B2 professional)
  - "Whatever" (89/million - negative connotation)

Step 3: Validate Pragmatic Appropriateness

  1. Does chunk actually perform the function?
  2. Would a native speaker use in this context?
  3. Is it appropriate for register/audience?

Check Matrix:

code
Chunk: "To be honest"
Function: Soften disagreement ✅
Context: Workplace negotiation ✅
Register: Professional/Friendly ✅
Authenticity: Native speaker would say ✅
Result: INCLUDE

Step 4: Create Closed Set

  1. Compile all validated chunks
  2. Organize by pragmatic function
  3. Cap total at 200-300 chunks (manageable for learners)
  4. Ensure chunks transfer across contexts (not topic-specific)

Example Closed Set Structure:

json
{
  "soften_disagreement": [
    "To be honest",
    "I see your point, but",
    "That's true, however",
    "I hear you, and",
    "I understand, but"
  ],
  "hedge_opinion": [
    "Kind of",
    "Sort of",
    "In a way",
    "Relatively speaking",
    "To some extent"
  ],
  "repair_tension": [
    "I'm sorry, I didn't catch that",
    "Let me rephrase",
    "What I meant to say",
    "I think I wasn't clear"
  ]
}

Step 5: Generate BUCKET_A Mapping

  1. Map chunks to difficulty levels (A2, B1, B2, C1)
  2. Organize by pedagogical priority
  3. Create BUCKET_A source for blank-inserter

Outputs

core_repertoire.json

json
{
  "metadata": {
    "date_generated": "2026-02-11",
    "corpus_sources": ["BNC Spoken", "COCA Spoken"],
    "cefr_levels": ["A2", "B1", "B2", "C1"],
    "total_chunks": 287,
    "validation_method": "Corpus frequency + pedagogical review"
  },

  "summary": {
    "chunk_categories": 12,
    "total_functions": 28,
    "frequency_threshold": 100,
    "register_focus": ["professional", "friendly", "neutral"],
    "variety": "British English (BNC primary source)"
  },

  "chunks_by_function": {
    "soften_disagreement": {
      "count": 18,
      "cefr_levels": ["B2", "C1"],
      "chunks": [
        {
          "text": "To be honest",
          "frequency_per_million": 487.3,
          "frequency_percentile": "Top 5%",
          "register": "neutral",
          "example_usage": "To be honest, I'm concerned about the timeline.",
          "validated": true,
          "validation_notes": "Native speakers use constantly, appropriate for all professional contexts"
        },
        {
          "text": "I see your point, but",
          "frequency_per_million": 156.4,
          "frequency_percentile": "Top 20%",
          "register": "professional",
          "example_usage": "I see your point, but we need to consider the risks.",
          "validated": true,
          "validation_notes": "Perfect for acknowledging before disagreeing"
        },
        {
          "text": "That's true, however",
          "frequency_per_million": 102.1,
          "frequency_percentile": "Top 25%",
          "register": "professional",
          "example_usage": "That's true, however, implementation will be challenging.",
          "validated": true,
          "validation_notes": "Slightly more formal, good for written and formal speech"
        }
      ]
    },

    "hedge_opinion": {
      "count": 15,
      "cefr_levels": ["A2", "B1", "B2"],
      "chunks": [
        {
          "text": "Kind of",
          "frequency_per_million": 892.4,
          "frequency_percentile": "Top 3%",
          "register": "informal",
          "example_usage": "It's kind of difficult to explain.",
          "validated": true,
          "validation_notes": "Very high frequency, appropriate for casual and semi-formal speech"
        }
      ]
    },

    "repair_tension": {
      "count": 12,
      "cefr_levels": ["B1", "B2"],
      "chunks": [
        {
          "text": "Let me rephrase",
          "frequency_per_million": 134.7,
          "frequency_percentile": "Top 22%",
          "register": "neutral",
          "example_usage": "I think I wasn't clear. Let me rephrase.",
          "validated": true,
          "validation_notes": "Essential repair marker, native speakers use frequently"
        }
      ]
    },

    "propose_alternative": {
      "count": 14,
      "cefr_levels": ["B1", "B2", "C1"],
      "chunks": [
        {
          "text": "What if we",
          "frequency_per_million": 267.1,
          "frequency_percentile": "Top 12%",
          "register": "neutral",
          "example_usage": "What if we tried a different approach?",
          "validated": true,
          "validation_notes": "Highly natural for suggestions and hypotheticals"
        }
      ]
    }
  },

  "bucket_a_mapping": {
    "description": "Core vocabulary for blank insertion (pedagogically prioritized)",
    "a2_level": {
      "chunks": ["Kind of", "Sort of", "I think", "What do you think"],
      "count": 24,
      "focus": "Recognition + basic hedging"
    },
    "b1_level": {
      "chunks": ["In a way", "Let me rephrase", "I hear you", "Sort of"],
      "count": 67,
      "focus": "Production + repair functions"
    },
    "b2_level": {
      "chunks": ["To be honest", "I see your point, but", "What if we", "Relatively speaking"],
      "count": 108,
      "focus": "Pragmatic functions + negotiation"
    },
    "c1_level": {
      "chunks": ["It might be worth considering", "There's merit in", "One could argue"],
      "count": 88,
      "focus": "Nuanced expression + subtext"
    }
  },

  "validation_checklist": {
    "frequency_above_threshold": {
      "status": "PASS",
      "all_chunks_above_100_per_million": true,
      "minimum_frequency_found": 102.1,
      "average_frequency": 267.4
    },
    "pragmatic_appropriateness": {
      "status": "PASS",
      "native_speaker_reviewed": true,
      "percentage_approved": 98.6
    },
    "register_consistency": {
      "status": "PASS",
      "registers_included": ["professional", "friendly", "neutral"],
      "no_slang_or_jargon": true
    },
    "cefr_alignment": {
      "status": "PASS",
      "all_levels_covered": true,
      "age_appropriate": true
    },
    "transferability": {
      "status": "PASS",
      "chunks_not_topic_specific": true,
      "reusable_across_contexts": true
    }
  }
}

Success Criteria (Agent-Level Validation)

✅ Pass Conditions

  1. All chunks >100 per million frequency

    • ✅ "Kind of" (892/million)
    • ❌ "Perspicacious" (2/million - too rare)
  2. Chunks are pragmatically appropriate

    • ✅ "Let me rephrase" (used for repair)
    • ❌ "Actually, you're wrong" (inappropriate for softening)
  3. Register matches context

    • ✅ Professional negotiation uses "I see your point, but" (formal)
    • ❌ Professional negotiation uses "Nah" (too casual)
  4. Core repertoire is closed set (200-300 chunks)

    • Too large: Learner overwhelmed, can't achieve mastery
    • Too small: Insufficient for authentic conversation
    • ✅ 287 chunks (balanced)
  5. Chunks transfer across contexts

    • ✅ "To be honest" works in workplace, casual, academic
    • ❌ "Move the deadline" (topic-specific, not universal)

❌ Fail Conditions

  • Chunks <100 per million (too rare)
  • Chunks are topic-specific (don't transfer)
  • Register mismatch (slang in formal setting)
  • Pragmatic function doesn't match usage
  • Set is too large (>400 chunks) or too small (<150)

Validation Tests

Test 1: Frequency Verification

typescript
function validateFrequency(chunks: Chunk[]): ValidationResult {
  return {
    allAboveThreshold: chunks.every(c => c.frequency >= 100),
    minFrequency: Math.min(...chunks.map(c => c.frequency)),
    avgFrequency: average(chunks.map(c => c.frequency)),
    percentileDistribution: chunks.map(c => ({
      text: c.text,
      percentile: calculatePercentile(c.frequency)
    }))
  }
}

Pass: All chunks ≥100/million, avg >200/million Fail: Any chunk <100/million OR avg <120/million

Test 2: Pragmatic Appropriateness

typescript
function validatePragmaticMatch(
  chunks: Chunk[],
  functions: PragmaticFunction[]
): ValidationResult {
  return {
    eachFunctionHasChunks: functions.every(f => hasChunks(f)),
    chunksMatchFunction: chunks.every(c => matchesAssignedFunction(c)),
    nativeSpeakerReview: consultNativeSpeaker().approval >= 0.9
  }
}

Pass: Native speaker: "These chunks naturally perform the functions" Fail: Native speaker: "Some chunks don't fit the function"

Test 3: Register Consistency

typescript
function validateRegister(
  chunks: Chunk[],
  targetRegister: string
): ValidationResult {
  return {
    registerConsistent: chunks.every(c => matchesRegister(c, targetRegister)),
    noSlangOrJargon: !chunks.some(c => isSlangOrJargon(c)),
    ageAppropriate: chunks.every(c => suitableForAge(c)),
    varietyAppropriate: chunks.every(c => isVariety(c, "british"))
  }
}

Pass: All chunks match register, no slang, appropriate for context Fail: Register inconsistencies or inappropriate content

Test 4: Closed Set Validation

typescript
function validateClosedSet(chunks: Chunk[]): ValidationResult {
  const count = chunks.length;
  return {
    sizeOptimal: count >= 200 && count <= 350,
    countDescription: `${count} chunks (optimal: 200-300)`,
    learnerManageable: count < 400,
    sufficient: count > 150
  }
}

Pass: 200-300 chunks (learner achievable, sufficient for authentic conversation) Fail: <150 (insufficient) or >400 (overwhelming)

Test 5: Transferability Check

typescript
function validateTransferability(chunks: Chunk[]): ValidationResult {
  return {
    topicSpecific: chunks.filter(c => isTopicSpecific(c)),
    transferable: chunks.filter(c => !isTopicSpecific(c)),
    transferabilityRatio: countTransferable(chunks) / chunks.length,
    minTransferabilityRatio: 0.95 // 95%+ must transfer
  }
}

Pass: ≥95% of chunks are non-topic-specific Fail: Topic-specific chunks >5% of total


Integration with Other Agents

← From Task Designer

  • Receives: pragmatic_functions + register + context
  • Uses for: Corpus lookup and validation

→ To Blank Inserter (via Settings)

  • Sends: core_repertoire.json
  • Sends as: BUCKET_A source (replaces hardcoded LOCKED_CHUNKS)
  • Format: JSON mapping chunks to CEFR levels + functions

Corpus Data Format

BNC Spoken 2-Grams Example

code
to be	12456.7	31678
i think	8942.3	22701
kind of	892.4	2267
in a	7234.1	18376
what do	2341.5	5947
is it	3456.2	8778

BNC Spoken 3-Grams Example

code
to be honest	487.3	1234
kind of like	156.8	398
what do you	267.1	678
i see your	89.2	226
sort of like	134.5	341

Data Structure

json
{
  "phrase": "to be honest",
  "frequency": 487.3,
  "frequency_per_million": 487.3,
  "corpus": "BNC Spoken",
  "word_count": 3,
  "occurrences": 1234,
  "total_words_in_corpus": 10000000
}

Examples

Example 1: B2 Workplace Negotiation

Pragmatic Functions:

  • Soften disagreement
  • Propose alternative
  • Repair tension

Curated Chunks (sample):

code
Soften:
- "To be honest" (487/million)
- "I see your point, but" (156/million)
- "That's true, however" (102/million)

Propose:
- "What if we" (267/million)
- "How about" (234/million)
- "Could we" (189/million)

Repair:
- "Let me rephrase" (135/million)
- "I didn't explain clearly" (98/million) ❌ Just below threshold, exclude

Example 2: A2 Restaurant Ordering

Pragmatic Functions:

  • Politely request
  • Acknowledge suggestion
  • Thank provider

Curated Chunks (sample):

code
Request:
- "Can I have" (567/million)
- "Could I get" (234/million)

Acknowledge:
- "That sounds good" (345/million)
- "OK, thanks" (789/million)

Thank:
- "Thank you" (1234/million)
- "Thanks a lot" (567/million)

Related Documentation

  • .claude/agents/README.md - Master architecture
  • .claude/agents/blank-inserter/SKILL.md - Uses BUCKET_A output
  • .claude/agents/learning-architect/SKILL.md - Defines functions
  • .claude/agents/task-designer/SKILL.md - Identifies functions