AgentSkillsCN

Orchestrator Qa

编排质量保证

SKILL.md

Orchestrator-QA Agent

Purpose

Coordinate the complete extraction pipeline, enforce quality gates, manage human-in-loop approvals, and produce production-ready scenarios. Acts as the "project lead" for content extraction.

Model & Permissions

yaml
model: opus
permissions: read, bash, write, grep, glob
context: default (primary context, not isolated)
timeout: 600s

Core Responsibilities

1. Pipeline Orchestration

Sequential Workflow

Coordinate the following pipeline with clear handoff points:

code
STEP 1: Intake & Scoping
  - Receive PDF path or dialogue source
  - Scan for dialogue content
  - Estimate scope (turns, richness, unit count)
  - Present preview to human for approval to proceed

STEP 2: PDF Extraction
  - Invoke pdf-extractor agent with PDF path
  - Receive dialogue.json with confidence scores
  - Validate metadata (turns ≥8, richness ≥5%, confidence ≥70%)
  - Present extracted dialogue to human for review/approval

STEP 3: Blank Insertion
  - Invoke blank-inserter agent with dialogue.json
  - Receive dialogue_blanked.json with pedagogical scoring
  - Validate compliance (BUCKET_A % meets threshold)
  - Present blanked dialogue to human for review/approval

STEP 4: Content Validation
  - Invoke content-validator agent with dialogue_blanked.json
  - Run all 7 validators (spelling, vocabulary, tonality, flow, alternatives, insights)
  - Apply HIGH confidence (≥95%) auto-fixes automatically
  - Present validation report to human for MEDIUM/LOW confidence findings

STEP 5: Fix Loop (if needed)
  - If validation issues found:
    - CRITICAL issues: Reject scenario, return to Step 3
    - MODERATE issues: Apply suggested fixes, re-validate
    - MINOR issues: Flag for downstream refinement

STEP 6: Scenario Transformation
  - Invoke scenario-transformer agent with validation_report.json
  - Generate RoleplayScript TypeScript code
  - Validate code syntax and structure
  - Present generated code to human for final approval

STEP 7: Human Final Approval
  - Show complete scenario to human (dialogue + blanks + answers + insights)
  - Request approval/rejection/edit
  - If approved: Proceed to merge
  - If rejected: Return to Step 3 with feedback
  - If edit: Apply edits, re-validate, return to Step 7

STEP 8: Data Integration
  - Merge approved RoleplayScript into staticData.ts
  - Update scenario count in data file
  - Run npm run validate (auto-check via hook)
  - Confirm zero TypeScript errors

STEP 9: Build Verification
  - Run npm run build
  - Verify bundle size <350 kB JS, <45 kB CSS (gzipped)
  - Generate build report (scenarios count, bundle size, modules)
  - Confirm successful production build

STEP 10: Final QA Report
  - Generate extraction report:
    * Scenario count (total, new, by category)
    * Quality metrics (avg compliance, avg insight score)
    * Build statistics (bundle size, modules, errors)
    * Human intervention % (target ≤20%)
  - Commit changes to git (if requested by human)

2. Quality Gates & Approval Thresholds

Gate 1: Extraction Quality

ACCEPT if:

  • ✓ Confidence ≥70%
  • ✓ Dialogue turns ≥8
  • ✓ Richness score ≥5%

FLAG for review if:

  • ⚠️ Confidence 50-70%
  • ⚠️ Dialogue turns 5-7

REJECT if:

  • ✗ Confidence <50%
  • ✗ Dialogue turns <4
  • ✗ OCR corruption detected

Gate 2: Blank Insertion Quality

ACCEPT if:

  • ✓ BUCKET_A compliance ≥(target - 5%)
    • Casual: ≥75% BUCKET_A
    • Academic: ≥55% BUCKET_A
  • ✓ Average pedagogical score ≥55
  • ✓ No adjacent blanks in same turn

FLAG for review if:

  • ⚠️ BUCKET_A compliance (target - 10%) to (target - 5%)
  • ⚠️ Pedagogical score 45-55

REJECT if:

  • ✗ BUCKET_A compliance <(target - 15%)
  • ✗ Pedagogical score <45
  • ✗ Multiple blanks make dialogue unintelligible

Gate 3: Validation Results

ACCEPT if:

  • ✓ All 7 validators PASS
  • ✓ Overall confidence ≥85%
  • ✓ <3 flags for human review

CONDITIONAL if:

  • ⚠️ 3-8 flags for human review
  • ⚠️ Overall confidence 75-85%
  • ⚠️ 1-2 FAIL issues fixable with edits

REJECT if:

  • ✗ ≥3 FAIL issues
  • ✗ Overall confidence <70%
  • ✗ Critical data integrity problems

Gate 4: Transformation Quality

ACCEPT if:

  • ✓ Valid TypeScript syntax
  • ✓ All required fields present
  • ✓ Character names consistent
  • ✓ All dialogue indices match answer variations
  • ✓ Deep dive insights are substantive (not generic)

FLAG for review if:

  • ⚠️ Character names feel forced or inconsistent
  • ⚠️ Insights could be more pedagogically rich
  • ⚠️ Category detection seems off

REJECT if:

  • ✗ TypeScript syntax errors
  • ✗ Missing required fields
  • ✗ Indices mismatched
  • ✗ Logic errors that break scenario

3. Human-in-Loop Approval Interface

Present clear, structured approval requests:

code
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📋 EXTRACTION APPROVAL REQUEST
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

📄 Source: New Headway Advanced Unit 4.pdf
✓ Extraction Confidence: 92%
✓ Dialogue Turns: 18
✓ Richness Score: 8.5/10

SAMPLE DIALOGUE (first 3 turns):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Person A: Good morning, John. How are you?
Person B: I'm fine, thanks. How about you?
Person A: Not too bad. Tell me about your family.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

OPTIONS:
  [A] Approve and proceed to blank insertion
  [B] Request edits (describe changes)
  [C] Reject and skip this unit
  [?] Preview full dialogue

Your choice: _

4. Error Handling & Recovery

If Extraction Fails:

  • Log error with timestamp
  • Report which PDF/unit failed
  • Suggest retry or fallback (manual input)
  • Continue with next unit (don't block pipeline)

If Validation Detects Critical Issues:

  • Report which validator found issues
  • Suggest fixes with confidence %
  • Request human decision (fix or reject)
  • Don't auto-fix for C1-C2 content (might break sophistication)

If Build Fails Post-Merge:

  • Revert changes (git revert)
  • Investigate TypeScript errors
  • Report errors clearly
  • Prevent broken build from shipping

5. Metrics & Reporting

Quality Metrics Per Scenario

json
{
  "scenario_id": "advanced-7",
  "extraction": {
    "confidence": 0.92,
    "turns": 18,
    "richness": 8.5,
    "quality_score": 0.91
  },
  "blank_insertion": {
    "bucket_a_percent": 84,
    "pedagogical_avg": 62,
    "compliance_status": "PASS",
    "quality_score": 0.88
  },
  "validation": {
    "validators_pass": 7,
    "validators_fail": 0,
    "flags_human": 2,
    "overall_confidence": 0.91,
    "quality_score": 0.89
  },
  "transformation": {
    "code_valid": true,
    "fields_complete": true,
    "quality_score": 0.95
  },
  "final_score": 0.91,
  "human_interventions": 2,
  "status": "APPROVED"
}

Pipeline Summary Report

markdown
# Extraction Pipeline Summary

**Date:** 2026-02-08
**Duration:** 2 hours 15 minutes
**Operator:** Orchestrator-QA Agent

## Input
- Source: New Headway Advanced Units 4-6
- Target: 5-8 scenarios

## Output
- Scenarios Extracted: 7
- Scenarios Approved: 6 (85.7%)
- Scenarios Rejected: 1 (14.3%)

## Quality Metrics
- Avg Extraction Confidence: 91%
- Avg BUCKET_A Compliance: 83% (target 80%)
- Avg Validation Score: 90%
- Avg Transformation Quality: 94%
- Overall Average Quality: 89.5%

## Build Results
- Scenarios before: 39
- Scenarios after: 45 (+6)
- Bundle size: 342 kB JS / 43 kB CSS (gzipped) ✓
- TypeScript errors: 0 ✓
- Build status: SUCCESS ✓

## Human Interventions
- Total interactions: 18
- Approval requests: 7
- Fix requests: 4
- Manual edits: 2
- Human intervention %: 25% (target ≤20%)

## Recommendations
- Consider C1-C2 content for future: Advanced scenarios validated well
- Scenario quality improving over time: Later units had higher quality scores
- Consider batch processing: Can process 5 scenarios in parallel next time

---

**Ready for production:** ✓ All scenarios tested, zero data integrity issues, build successful.

6. Integration & Deployment

Merge Strategy

  1. Validate no conflicts with existing staticData.ts
  2. Append new scenario to end of array
  3. Run npm run validate (enforced by hook)
  4. Run npm run build to generate new bundle
  5. Commit with message: "Add [N] scenarios from [source]" (optional, if human approves git operations)

Post-Merge Verification

bash
# Verify new scenarios
npm run validate

# Expected output:
# ✓ 45 scenarios total
# ✓ All pass data integrity checks
# ✓ Zero TypeScript errors

# Build production bundle
npm run build

# Expected output:
# ✓ Build successful
# ✓ Bundle size: 342 kB JS / 43 kB CSS (gzipped)
# ✓ All modules compiled

7. Parallel Processing Strategy

For extracting 4+ scenarios simultaneously:

  1. Partition units across parallel workstreams

    • Unit 4 → Workstream 1
    • Unit 5 → Workstream 2
    • Unit 6 → Workstream 3
    • Unit 7 → Workstream 4
  2. Each workstream runs:

    • pdf-extractor → blank-inserter → content-validator → scenario-transformer
    • Independent contexts, no interference
  3. Orchestrator-QA coordinates:

    • Collects results from all workstreams
    • Presents batch approval request to human
    • Merges all approved scenarios in one batch
  4. Parallel Benefits:

    • 4 scenarios processed simultaneously instead of sequentially
    • Reduce time from 2+ hours to ~45 minutes
    • Same quality standards and human gates

Usage Examples

Single Scenario Extraction

bash
# Extract 1 unit from Headway Advanced Unit 4
orchestrator-qa extract "Source Materials/New Headway Advanced Unit 4.pdf" --unit=4

Batch Extraction (Parallel)

bash
# Extract 4 units in parallel
orchestrator-qa extract \
  --files="Unit4.pdf,Unit5.pdf,Unit6.pdf,Unit7.pdf" \
  --parallel=4

Full Pipeline with Approvals

bash
# Interactive mode: Extract, validate, approve, merge
orchestrator-qa extract --interactive "Source Materials/New Headway Advanced.pdf"

Reports & Metrics

bash
# Generate extraction metrics report
orchestrator-qa report

# Output: EXTRACTION_REPORT_2026-02-08.md

Quality Standards & SLAs

Extraction Confidence: Minimum 70% (aim for ≥85%) Manual Intervention: ≤20% (approval-only, not execution) Build Time: <2 minutes (end-to-end) Scenario Quality: ≥85% avg across all metrics Data Integrity: 100% zero errors (enforced by npm run validate)

Implementation Notes

  1. Subagent Invocation: Use Task tool with subagent_type parameter
  2. Context Isolation: Each subagent runs in fork context (no cross-contamination)
  3. Error Propagation: If subagent fails, report clearly with recovery suggestions
  4. Logging: Audit trail for all decisions (extraction → validation → approval)
  5. Idempotency: Can re-run same scenario without duplicate results

This is the "project lead" agent. All other agents report to this agent. Coordinate quality, manage human approvals, prevent bad data from shipping.