AgentSkillsCN

model-certification

全程指导HuggingFace模型认证:从模板创建手册,于任意层级(烟雾测试/MVP/快速测试/标准测试/深度测试)开展资格测试,收集证据,运用G0-G4网关计算MQS分数,更新models.csv,并同步README认证表格。涵盖SafeTensors的基准真相、LAYOUT-002合规性,以及波普尔式的证伪方法论。

SKILL.md
--- frontmatter
name: model-certification
description: "Guides end-to-end HuggingFace model certification: playbook creation from templates, running qualification at any tier (Smoke/MVP/Quick/Standard/Deep), collecting evidence, computing MQS scores with G0-G4 gateways, updating models.csv, and syncing the README certification table. Covers SafeTensors ground truth, LAYOUT-002 compliance, and Popperian falsification methodology."
disable-model-invocation: false
user-invocable: true
allowed-tools: "Read, Grep, Glob, Bash"
argument-hint: "model or family: qwen-coder, llama, starcoder, mistral, phi, deepseek, or tier: smoke, mvp, quick, standard, deep"

Model Certification

This skill guides the full qualification pipeline for HuggingFace models using Popperian falsification and Toyota Production System principles.

Quick Start

Certify a Model (Recommended Path)

bash
# 1. Pick a tier
cargo run --bin apr-qa -- certify --family qwen-coder --tier mvp

# 2. Or run a specific playbook
cargo run --bin apr-qa -- run playbooks/models/qwen2.5-coder-1.5b-mvp.playbook.yaml \
    --output certifications/qwen2.5-coder-1.5b/evidence.json

# 3. Update certification registry
make update-certifications

Makefile Shortcuts

bash
make certify-smoke      # Tier 1: ~1-2 min, minimal sanity
make certify-mvp        # Tier 2: ~5-10 min, full surface coverage
make certify-quick      # Tier 3: ~10-30 min, balanced (default)
make certify-standard   # Tier 4: ~1-2 hr, extended matrix
make certify-deep       # Tier 5: ~8-24 hr, production certification
make certify-qwen       # Priority: Qwen Coder family
make ci-smoke           # CI: 1.5B safetensors CPU only
make nightly-7b         # Nightly: 7B MVP qualification

Certification Tiers

TierTimeTest MatrixPlaybook SuffixPass Threshold
Smoke~1-2 minsafetensors/cpu/run only-smokeMQS >= 700
MVP~5-10 min3 formats x 2 backends x 3 modalities = 18-mvpMQS >= 700
Quick~10-30 minBalanced coverage, 10+ scenarios(none)MQS >= 700
Standard~1-2 hrExtended matrix, 170+ data points(none)MQS >= 700
Deep~8-24 hrFull matrix, 1800+ tests(none)MQS >= 900

Tier Selection Guide

  • Smoke: "Does it load at all?" Quick regression check.
  • MVP: "Does it work across all format/backend/modality combos?" Pre-release gate.
  • Quick: "Is it stable enough for development?" CI/CD pipeline default.
  • Standard: "Is it production-viable?" Extended stress testing.
  • Deep: "Full production certification." Comprehensive falsification.

Pipeline Steps

Step 1: Create a Playbook

Start from a template:

bash
# Copy the MVP template
cp playbooks/templates/mvp.yaml \
   playbooks/models/my-model-1.5b-mvp.playbook.yaml

Edit the playbook with model-specific values. See playbook-anatomy.md for field reference.

Available Templates:

TemplateMatrix SizeUse Case
mvp.yaml18 tests (3x2x3)Full surface coverage
quick-check.yaml10 tests (1x1x1x10)Fast sanity check
basic-verify.yaml9 tests (3x1x1x3)Format comparison
ci-pipeline.yaml225 tests (3x1x3x25)CI/CD gate
full-qualification.yaml1800 tests (3x2x3x100)Production certification

Step 2: Run Certification

bash
# By family and tier (auto-finds playbook)
cargo run --bin apr-qa -- certify \
    --family qwen-coder \
    --tier mvp \
    --model-cache ~/.cache/apr/models

# By specific playbook
cargo run --bin apr-qa -- run \
    playbooks/models/qwen2.5-coder-1.5b-mvp.playbook.yaml \
    --output certifications/qwen2.5-coder-1.5b/ \
    --failure-policy collect-all

# Dry run first (see what would execute)
cargo run --bin apr-qa -- certify --family qwen-coder --tier mvp --dry-run

Key CLI Options:

OptionDefaultDescription
--tierquickCertification tier
--failure-policystop-on-p0stop-on-first, stop-on-p0, collect-all, fail-fast
--workers4Parallel test workers
--timeout60000Per-test timeout (ms)
--no-gpufalseCPU-only mode
--model-cache-Model file directory
--apr-binaryaprPath to APR binary
--dry-runfalseShow plan without executing
--fail-fastfalseStop + enhanced diagnostics

Failure Policies (Jidoka):

PolicyBehaviorWhen to Use
stop-on-firstHalt on any failureDebugging a specific issue
stop-on-p0Halt on critical (G0-G4) failure, continue on othersDefault for most runs
collect-allRun everything, report all failuresMVP/certification runs
fail-fastHalt + emit enhanced tracing diagnosticsDeep debugging

Step 3: Review Evidence

bash
# Score the evidence
cargo run --bin apr-qa -- score certifications/my-model/evidence.json

# Generate reports (HTML + JUnit XML + Markdown)
cargo run --bin apr-qa -- report certifications/my-model/evidence.json \
    --output certifications/my-model/ \
    --formats all

Step 4: Export to Registry

bash
# Export evidence to certification CSV
cargo run --bin apr-qa -- export-csv \
    --evidence-dir docs/certifications/evidence \
    --output docs/certifications/models.csv

# Sync README table from CSV
make update-certifications

Step 5: Validate Contract Compliance

bash
# Validate tensor layout contract
cargo run --bin apr-qa -- validate-contract /path/to/model.gguf

# Run all 5 conversion invariants
./scripts/diagnose-conversion.sh /path/to/model.gguf

MQS Scoring System (0-1000)

Score Calculation

Six categories, 1000 raw points total:

CategoryCodeMax PointsWhat It Measures
QualityQUAL200Basic quality, loads, responds
PerformancePERF150Throughput, latency metrics
StabilitySTAB200Stability under stress
CompatibilityCOMP150Format/backend coverage
Edge CasesEDGE150Edge case handling
RegressionREGR150Regression resistance

Penalties:

  • Crash: -20 points each
  • Timeout: -10 points each
  • Gateway failure: -1000 (zeroes entire score)

Normalization: Logarithmic scaling f(x) = 100 * log(1 + 9x) / log(10) maps raw 0-1000 to normalized 0-100.

Grade Mapping

NormalizedGradeStatus
>= 97A+CERTIFIED
>= 93ACERTIFIED
>= 90A-CERTIFIED
>= 83BPROVISIONAL
>= 70CQualifies
>= 60DBelow threshold
< 60FBLOCKED

Qualification Thresholds

  • qualifies(): gateways passed AND normalized >= 70
  • is_production_ready(): gateways passed AND normalized >= 90

Gateway System (G0-G4)

Any gateway failure zeroes the entire MQS score to 0.

GateNameWhat It ChecksCommon Failure
G0Integrityconfig.json matches tensor metadataCorrupted config, wrong tensor count
G1LoadModel loads without errorsMissing files, bad format, OOM
G2InferenceBasic inference produces outputTimeout, crash during forward pass
G3StabilityNo crashes, panics, or segfaultsLAYOUT-002 violations, null pointers
G4QualityOutput is not garbageRepetitive patterns, NaN/Inf, encoding errors

See gateway-diagnostics.md for failure diagnosis.

Format Hierarchy

code
SafeTensors (Ground Truth)
    |
    +-- APR (Native optimized, converted from SafeTensors)
    |
    +-- GGUF (Third-party, MUST be converted via aprender)

SafeTensors is always the source of truth. GGUF uses column-major layout (GGML convention); aprender transposes during import. Testing GGUF directly with realizar produces garbage output (LAYOUT-002 violation).

Correct workflow:

bash
# 1. Convert GGUF -> APR (aprender transposes layout)
apr import model.gguf -o model.apr

# 2. Run qualification on APR
cargo run --bin apr-qa -- certify --model model.apr

Contract Invariants (I-1 through I-5)

InvariantNameGate IDWhat It Catches
I-1Round-trip IdentityF-CONTRACT-I1-001Inference divergence after conversion
I-2Tensor Name BijectionF-CONTRACT-I2-001Missing/extra tensors in converted model
I-3No Silent FallbacksF-CONTRACT-I3-001Unknown dtype defaulting to F32
I-4Statistical PreservationF-CONTRACT-I4-001Tensor statistics drift beyond tolerance
I-5Tokenizer RoundtripF-CONTRACT-I5-001First-token mismatch between formats

Certification Status State Machine

code
PENDING --[run tests]--> BLOCKED (any gateway fails OR MQS < 700)
PENDING --[run tests]--> PROVISIONAL (MQS >= 700, < 850)
PENDING --[run tests]--> CERTIFIED (MQS >= 850)

CSV Status Values: CERTIFIED, PROVISIONAL, BLOCKED, PENDING, PARTIAL, FAIL

models.csv Schema (20 columns)

code
model_id, family, parameters, size_category, status, mqs_score, grade,
certified_tier, last_certified, g1, g2, g3, g4, tps_gguf_cpu, tps_gguf_gpu,
tps_apr_cpu, tps_apr_gpu, tps_st_cpu, tps_st_gpu, provenance_verified
FieldTypeValues
model_idstringHuggingFace repo ID (e.g., Qwen/Qwen2.5-Coder-1.5B-Instruct)
size_categoryenumtiny, small, medium, large, xlarge, huge
statusenumCERTIFIED, PROVISIONAL, BLOCKED, PENDING, PARTIAL, FAIL
gradeenumA+, A, A-, B+, B, B-, C+, C, C-, D+, D, D-, F
certified_tierenumsmoke, quick, mvp, standard, deep, none
g1-g4boolGateway pass/fail
tps_*floatTokens/second by format and backend

Common Pitfalls

1. Using GGUF Directly (LAYOUT-002)

GGUF is column-major. Running it directly through realizar produces garbage. Always convert first.

Symptom: G4 failure with garbage like olumbia+lsi nunca/localENTS

Fix: Convert GGUF -> APR via apr import model.gguf -o model.apr

2. Missing --model-cache

The certify command needs to find model files. Either:

  • Set --model-cache /path/to/models
  • Or ensure models are in the default cache location

3. Forgetting to Update README

After certification runs, ALWAYS sync the README table:

bash
make update-certifications

4. Wrong Failure Policy for the Task

  • Debugging? Use --fail-fast for enhanced tracing
  • Certification run? Use --failure-policy collect-all to get complete picture
  • CI gate? Use --failure-policy stop-on-p0 (default)

Other CLI Subcommands

CommandPurposeExample
generateGenerate test scenariosapr-qa generate Qwen/Qwen2.5-Coder-1.5B-Instruct -c 50
scoreCalculate MQS from evidenceapr-qa score evidence.json
reportGenerate HTML/JUnit/Markdownapr-qa report evidence.json --formats all
listQuery model registryapr-qa list --size small
lock-playbooksGenerate integrity lock fileapr-qa lock-playbooks playbooks/models/
ticketsAuto-generate failure ticketsapr-qa tickets evidence.json --repo paiml/aprender
parityHF golden corpus verificationapr-qa parity --model-family qwen2.5-coder-1.5b
export-csvExport evidence to CSVapr-qa export-csv --evidence-dir evidence/
export-evidenceExport structured evidenceapr-qa export-evidence source.json --model Qwen/...
validate-contractTensor layout contract checkapr-qa validate-contract model.gguf
toolsAPR tool coverage testsapr-qa tools /path/to/model

See Also

References

Scripts

Key Files

FilePurpose
playbooks/playbook.schema.yamlPlaybook JSON Schema
playbooks/evidence.schema.jsonEvidence artifact schema
docs/certifications/models.csvCertification registry (93+ models)
playbooks/templates/5 reusable templates
playbooks/models/120+ model-specific playbooks
scripts/diagnose-conversion.sh5-invariant conversion test
scripts/validate-schemas.shSchema validation
scripts/validate-aprender-alignment.shCross-repo consistency