/skill-test - Interactive Skill Testing

Test skills by executing generated code and building verified ground truth datasets.

Usage

code

/skill-test <skill-name>

Read the target skill's SKILL.md and key references:

code

.claude/skills/<skill-name>/SKILL.md
.claude/skills/<skill-name>/references/  (if exists)

Ask the user for a test prompt. Example prompts for mlflow-evaluation:

Respond using the loaded skill's knowledge. Include complete, runnable Python code.

Extract Python code blocks from the response. For each block:

•

Write to temp file:

bash

cat > /tmp/skill_test_block_N.py << 'EOF'
<code>
EOF

•

Execute with timeout:

bash

timeout 30 python /tmp/skill_test_block_N.py

•
Report result:
- •Success: "Block N: Executed successfully"
- •Failure: "Block N: Failed - <error>"

If ALL code blocks execute successfully, ask:

"All code executed successfully. Save as ground truth? [Y/n]"

If yes, append to benchmarks/skills/<skill-name>/ground_truth.yaml. See ground-truth-format.md for the YAML schema.

If any code block fails, show the error and ask:

"Code execution failed. [R]etry with fix, [S]kip, or [A]bort?"

•Create benchmarks/skills/<skill-name>/ directory if it doesn't exist
•Auto-increment ground truth IDs (gt-001, gt-002, etc.)
•Tag examples with inferred categories (e.g., "evaluation", "scorer", "dataset")