SKILL: Assess Output Quality

Purpose

Score LLM output quality (0.0-1.0) against task requirements to determine if iteration is needed or solution is complete.

Description

Evaluates output against four criteria:

•Correctness: Is the logic sound? Does it work?
•Completeness: Are all requirements addressed?
•Clarity: Is it well documented and readable?
•Quality: Is it production-ready?

Returns structured assessment with score, strengths, gaps, and recommendation.

Usage

bash

/assess-quality --task "Create palindrome checker" --output output.md
/assess-quality --task "task" --output - # Read from stdin
/assess-quality --task "task" --threshold 0.85

Parameters

Parameter	Default	Description
`--task`	required	Original task description
`--output`	required	File path or `-` for stdin
`--threshold`	0.90	Quality bar for acceptance
`--verbose`	false	Show detailed reasoning

Output Format

xml

<quality_assessment>
<score>0.87</score>
<verdict>GOOD - Threshold Met (0.85)</verdict>

<criteria>
  <criterion name="Correctness" score="0.90">
    Logic is sound, handles edge cases correctly.
    Algorithm terminates and returns expected values.
  </criterion>
  <criterion name="Completeness" score="0.85">
    All requirements addressed.
    Minor: could add more test cases.
  </criterion>
  <criterion name="Clarity" score="0.88">
    Well documented with docstring.
    Variable names are descriptive.
  </criterion>
  <criterion name="Quality" score="0.85">
    Production ready with error handling.
    Type hints included.
  </criterion>
</criteria>

<strengths>
  <strength>Comprehensive error handling</strength>
  <strength>Clear documentation with examples</strength>
  <strength>Edge cases covered (empty, single char)</strength>
  <strength>Type validation included</strength>
  <strength>Two implementation approaches provided</strength>
</strengths>

<gaps>
  <gap priority="low">Could add performance benchmarks</gap>
  <gap priority="low">Unicode handling not explicit</gap>
</gaps>

<recommendation>
ACCEPT - Output meets quality threshold (0.87 >= 0.85).
No further iteration needed unless performance optimization required.
</recommendation>
</quality_assessment>

Scoring Rubric

Score	Level	Description
0.95-1.0	EXCELLENT	Production-ready, comprehensive, exemplary
0.85-0.94	GOOD	Meets requirements, minor improvements possible
0.70-0.84	ACCEPTABLE	Works but needs refinement
0.50-0.69	NEEDS WORK	Significant gaps, iterate required
< 0.50	POOR	Major issues, likely needs restart

Verdicts

•ACCEPT: Quality >= threshold, no iteration needed
•ITERATE: Quality < threshold, improvement recommended
•CLARIFY: Task ambiguous, need more requirements
•RESTART: Fundamental issues, new approach needed

How It Works

•Send to LLM with assessment rubric
•Score each criterion (0.0-1.0)
•Aggregate weighted average
•Identify strengths and gaps
•Compare to threshold
•Return verdict and reasoning

Assessment Prompt

markdown

Assess the quality of this output for the given task.

TASK:
{task}

OUTPUT:
{output}

Score each criterion 0.0-1.0:
1. Correctness: Is the logic sound?
2. Completeness: All requirements met?
3. Clarity: Well documented and readable?
4. Quality: Production ready?

Return overall score (0.0-1.0) and detailed reasoning.

Examples

Example 1: High Quality Output

bash

$ /assess-quality --task "Implement binary search" --output binary_search.py

xml

<quality_assessment>
<score>0.94</score>
<verdict>EXCELLENT - Exceeds threshold</verdict>

<criteria>
  <criterion name="Correctness" score="0.95">
    Algorithm correct, handles all cases including empty array.
  </criterion>
  <criterion name="Completeness" score="0.93">
    All requirements met, plus bonus insert-position function.
  </criterion>
  <criterion name="Clarity" score="0.95">
    Excellent docstring with examples, clear variable names.
  </criterion>
  <criterion name="Quality" score="0.92">
    Type hints, error handling, overflow-safe calculation.
  </criterion>
</criteria>

<strengths>
  <strength>Production-ready code</strength>
  <strength>Comprehensive documentation</strength>
  <strength>Edge case handling (empty, single, not found)</strength>
  <strength>Bonus utility function included</strength>
</strengths>

<gaps>
  <gap priority="very-low">Could add type stubs file</gap>
</gaps>

<recommendation>
ACCEPT - Excellent quality. Ready for production use.
</recommendation>
</quality_assessment>

Example 2: Needs Iteration

bash

$ /assess-quality --task "Create API rate limiter" --output rate_limiter_v1.py

xml

<quality_assessment>
<score>0.68</score>
<verdict>NEEDS WORK - Below threshold (0.90)</verdict>

<criteria>
  <criterion name="Correctness" score="0.75">
    Basic logic works but race condition possible in distributed scenario.
  </criterion>
  <criterion name="Completeness" score="0.60">
    Missing: distributed sync, failure handling, monitoring.
  </criterion>
  <criterion name="Clarity" score="0.72">
    Documentation present but lacks architecture explanation.
  </criterion>
  <criterion name="Quality" score="0.65">
    Not production ready - needs error handling, tests.
  </criterion>
</criteria>

<strengths>
  <strength>Token bucket algorithm implemented</strength>
  <strength>Basic rate limiting works for single node</strength>
</strengths>

<gaps>
  <gap priority="high">No distributed consistency mechanism</gap>
  <gap priority="high">Missing failure mode handling</gap>
  <gap priority="medium">No monitoring/metrics</gap>
  <gap priority="medium">Missing test suite</gap>
</gaps>

<recommendation>
ITERATE - Significant gaps remain. Focus next iteration on:
1. Add Redis-based distributed locking
2. Implement circuit breaker for failures
3. Add monitoring hooks
</recommendation>
</quality_assessment>

Example 3: Ambiguous Task

bash

$ /assess-quality --task "Make it better" --output code.py

xml

<quality_assessment>
<score>N/A</score>
<verdict>CLARIFY - Task requirements unclear</verdict>

<issues>
  <issue>Task "make it better" lacks specific requirements</issue>
  <issue>Cannot assess completeness without clear goals</issue>
  <issue>Quality criteria undefined</issue>
</issues>

<recommendation>
CLARIFY - Please specify:
- What aspects need improvement?
- What are the success criteria?
- What constraints apply?
</recommendation>
</quality_assessment>

When to Use

Use when:

•After each iteration to decide if done
•Validating output before delivery
•Justifying why solution is complete
•Identifying specific improvement areas
•Comparing multiple solution versions

Don't use when:

•Output is trivially correct (obvious)
•Task is exploratory (no clear criteria)
•Just want quick feedback (ask directly)

Integration

Chain with other skills:

bash

# Iterate then assess
$ /meta-prompt-iterate "task" && /assess-quality --output .prompts/*/FINAL.md

# Assess multiple iterations
$ for f in .prompts/*/output.md; do /assess-quality --output "$f"; done

# Use in decision loop
quality=$(/assess-quality --output output.md | grep score | cut -d'>' -f2 | cut -d'<' -f1)
if [ $(echo "$quality < 0.90" | bc) -eq 1 ]; then
  /meta-prompt-iterate "task" --context output.md
fi

# Compare versions
$ /assess-quality --output v1.py > v1-quality.xml
$ /assess-quality --output v2.py > v2-quality.xml
$ diff v1-quality.xml v2-quality.xml

Threshold Guidelines

Task Type	Suggested Threshold
Simple utilities	0.85
Production code	0.90
Critical systems	0.95
Documentation	0.85
Architecture designs	0.90

Implementation

Uses quality assessment from the meta-prompting engine:

python

from meta_prompting_engine.llm_clients.claude import ClaudeClient

llm = ClaudeClient(api_key="...")

# Assessment prompt
assessment_prompt = f"""
Assess the quality of this output for the task.

TASK: {task}
OUTPUT: {output}

Return a score 0.0-1.0 based on:
- Correctness (logic sound?)
- Completeness (requirements met?)
- Clarity (documented?)
- Quality (production ready?)
"""

response = llm.complete([
    {"role": "user", "content": assessment_prompt}
], max_tokens=10)

score = float(response.content)
print(f"Quality: {score}")

Source

•Engine: /meta_prompting_engine/core.py (quality assessment logic)
•Client: /meta_prompting_engine/llm_clients/claude.py
•Tests: /tests/test_core_engine.py
•Real results: /test_real_api.py output