Agentic Evaluation Framework (AEF)
A modular quality-control architecture for AI systems.
This framework transforms one-shot generation into a controlled evaluation lifecycle:
Generate → Evaluate → Adversarial Review → Optimize → Confidence Check → Converge → Output
Core Architecture
1. Generator
Responsible only for producing candidate outputs.
def generate(task: str) -> str:
return llm(f"Complete the task:\n{task}")
2. Evaluator
Scores output using structured criteria.
import json
def evaluate(task: str, output: str) -> dict:
return json.loads(llm(f"""
Evaluate the output for the given task.
Task: {task}
Output: {output}
Return JSON:
{{
"overall_score": 0-1,
"dimensions": {{
"accuracy": 0-1,
"clarity": 0-1,
"completeness": 0-1
}},
"confidence": 0-1,
"feedback": "actionable critique"
}}
"""))
3. Adversarial Reviewer (Optional but Recommended)
For robustness and edge-case detection.
def adversarial_review(task: str, output: str) -> str:
return llm(f"""
You are a critical reviewer.
Find hidden flaws, missing assumptions,
edge cases, logical gaps, or failure risks.
Task: {task}
Output: {output}
""")
4. Optimizer
Refines output based on structured feedback.
def optimize(task: str, output: str, feedback: str) -> str:
return llm(f"""
Improve the output based on the following critique.
Task: {task}
Feedback: {feedback}
Original Output: {output}
""")
5. Controller (Refinement Loop)
Coordinates lifecycle and convergence logic.
def run_agent(task: str, max_iterations: int = 4, threshold: float = 0.85):
history = []
output = generate(task)
previous_score = 0.0
for i in range(max_iterations):
evaluation = evaluate(task, output)
score = evaluation["overall_score"]
history.append({
"iteration": i,
"output": output,
"score": score,
"confidence": evaluation["confidence"],
"feedback": evaluation["feedback"]
})
# Success threshold
if score >= threshold and evaluation["confidence"] >= 0.8:
break
# Convergence detection (plateau)
if abs(score - previous_score) < 0.01:
break
previous_score = score
output = optimize(task, output, evaluation["feedback"])
return {
"final_output": output,
"history": history
}
Advanced Patterns
Multi-Judge Consensus
Reduces evaluator bias and increases reliability.
import statistics
def ensemble_score(task: str, output: str, n: int = 3):
scores = []
for _ in range(n):
scores.append(evaluate(task, output)["overall_score"])
return {
"mean": sum(scores) / len(scores),
"variance": statistics.variance(scores) if len(scores) > 1 else 0
}
Interpretation:
- •Low variance → stable evaluation
- •High variance → unreliable scoring, consider adversarial review
Rubric-Based Evaluation
RUBRIC = {
"accuracy": 0.4,
"clarity": 0.3,
"completeness": 0.3
}
Weighted scoring can be computed by multiplying each dimension by its weight.
Tournament Generation (Alternative to Iteration)
Generate multiple candidates and select best.
def tournament(task: str, k: int = 3):
candidates = [generate(task) for _ in range(k)]
best = candidates[0]
for candidate in candidates[1:]:
decision = llm(f"""
Compare Output A and Output B.
Which better completes the task and why?
Task: {task}
Output A: {best}
Output B: {candidate}
Respond with: A or B
""")
if "B" in decision:
best = candidate
return best
Benchmarking Mode (Project-Wide Quality Control)
Use dataset-driven regression testing.
def benchmark(agent, test_set):
scores = []
for example in test_set:
result = agent(example["task"])
evaluation = evaluate(example["task"], result["final_output"])
scores.append(evaluation["overall_score"])
return sum(scores) / len(scores) if scores else 0
Use to:
- •Detect regressions
- •Compare model versions
- •Track improvement over time
Failure Modes & Mitigations
| Failure Mode | Mitigation |
|---|---|
| Self-affirming bias | Use separate evaluator model |
| Score inflation drift | Benchmark against fixed dataset |
| Reward hacking | Randomize rubric phrasing |
| Mode collapse | Add diversity sampling |
| Evaluator hallucination | Require justification text |
Cost-Aware Routing
Reflection increases token cost 2–5×.
Optimization strategies:
- •Skip evaluation for trivial outputs
- •Use smaller model as evaluator
- •Early-stop when high confidence
- •Cache repeated evaluations
Example:
if len(output) < 50:
return {"final_output": output, "history": []}
Confidence-Based Routing Logic
Recommended decision flow:
- •High score + high confidence → Accept
- •High score + low confidence → Adversarial review
- •Low score + high variance → Regenerate
- •Low score + low confidence → Optimize
Logging & Trace Structure
Always store evaluation trajectory for observability.
trajectory = {
"task": task,
"iterations": [
{
"output": "...",
"score": 0.82,
"confidence": 0.74,
"feedback": "..."
}
]
}
Enables:
- •Debugging
- •Drift detection
- •Failure clustering
- •Evaluator monitoring
- •Auditability
Maturity Levels
Level 1 — Basic Reflection
Level 2 — Evaluator Separation
Level 3 — Adversarial & Ensemble
Level 4 — Benchmark-Driven
Level 5 — Confidence-Calibrated & Cost-Aware
When To Use This Skill
Use for:
- •Code generation
- •Reports
- •Business analysis
- •Research synthesis
- •Data interpretation
- •Any quality-critical output
Avoid for:
- •Casual chat
- •Ultra-low-latency use cases
- •Low-stakes responses
Design Philosophy
This framework treats generation as stochastic and evaluation as control.
It assumes:
- •Outputs can improve
- •Scores can be quantified
- •Quality can be systematized
- •Evaluation must be auditable
The goal is measurable, iterative improvement — not perfection.