LLM Judge
Compare code implementations across multiple repositories using structured evaluation.
Usage
/beagle-analysis:llm-judge <spec> <repo1> <repo2> [repo3...] [--labels=...] [--weights=...] [--branch=...]
Arguments
| Argument | Required | Description |
|---|---|---|
spec | Yes | Path to spec/requirements document |
repos | Yes | 2+ paths to repositories to compare |
--labels | No | Comma-separated labels (default: directory names) |
--weights | No | Override weights, e.g. functionality:40,security:30 |
--branch | No | Branch to compare against main (default: main) |
Workflow
- •Parse
$ARGUMENTSintospec_path,repo_paths,labels,weights, andbranch. - •Validate the spec file, each repo path, and the minimum repo count.
- •Read the spec document into memory.
- •Load this skill and the supporting reference files.
- •Spawn one Phase 1 repo agent per repository to gather facts only.
- •Validate the repo-agent JSON results before proceeding.
- •Spawn one Phase 2 judge agent per dimension.
- •Aggregate scores, compute weighted totals, rank repos, and write the report.
- •Display the markdown summary and verify the JSON report.
Command Workflow
Step 1: Parse Arguments
Parse $ARGUMENTS to extract:
- •
spec_path: first positional argument - •
repo_paths: remaining positional arguments (must be 2+) - •
labels: from--labelsor derived from directory names - •
weights: from--weightsor defaults - •
branch: from--branchormain
Default Weights:
{
"functionality": 30,
"security": 25,
"tests": 20,
"overengineering": 15,
"dead_code": 10
}
Step 2: Validate Inputs
[ -f "$SPEC_PATH" ] || { echo "Error: Spec file not found: $SPEC_PATH"; exit 1; }
for repo in "${REPO_PATHS[@]}"; do
[ -d "$repo/.git" ] || { echo "Error: Not a git repository: $repo"; exit 1; }
done
[ ${#REPO_PATHS[@]} -ge 2 ] || { echo "Error: Need at least 2 repositories to compare"; exit 1; }
Step 3: Read Spec Document
SPEC_CONTENT=$(cat "$SPEC_PATH") || { echo "Error: Failed to read spec file: $SPEC_PATH"; exit 1; }
[ -z "$SPEC_CONTENT" ] && { echo "Error: Spec file is empty: $SPEC_PATH"; exit 1; }
Step 4: Load the Skill
Load the llm-judge skill: Skill(skill: "beagle-analysis:llm-judge")
Step 5: Phase 1 - Spawn Repo Agents
Spawn one Task per repo:
You are a Phase 1 Repo Agent for the LLM Judge evaluation. **Your Repo:** $LABEL at $REPO_PATH **Spec Document:** $SPEC_CONTENT **Instructions:** 1. Load skill: Skill(skill: "beagle-analysis:llm-judge") 2. Read references/repo-agent.md for detailed instructions 3. Read references/fact-schema.md for the output format 4. Load Skill(skill: "beagle-core:llm-artifacts-detection") for analysis Explore the repository and gather facts. Return ONLY valid JSON following the fact schema. Do NOT score or judge. Only gather facts.
Collect all repo outputs into ALL_FACTS.
Step 6: Validate Phase 1 Results
echo "$FACTS" | python3 -c "import json,sys; json.load(sys.stdin)" 2>/dev/null || { echo "Error: Invalid JSON from $LABEL"; exit 1; }
Step 7: Phase 2 - Spawn Judge Agents
Spawn five judge agents, one per dimension:
You are the $DIMENSION Judge for the LLM Judge evaluation. **Spec Document:** $SPEC_CONTENT **Facts from all repos:** $ALL_FACTS_JSON **Instructions:** 1. Load skill: Skill(skill: "beagle-analysis:llm-judge") 2. Read references/judge-agents.md for detailed instructions 3. Read references/scoring-rubrics.md for the $DIMENSION rubric Score each repo on $DIMENSION. Return ONLY valid JSON with scores and justifications.
Step 8: Aggregate Scores
for repo_label in labels:
scores[repo_label] = {}
for dimension in dimensions:
scores[repo_label][dimension] = judge_outputs[dimension]['scores'][repo_label]
weighted_total = sum(
scores[repo_label][dim]['score'] * weights[dim] / 100
for dim in dimensions
)
scores[repo_label]['weighted_total'] = round(weighted_total, 2)
ranking = sorted(labels, key=lambda l: scores[l]['weighted_total'], reverse=True)
Step 9: Generate Verdict
Name the winner, explain why they won, and note any close calls or trade-offs.
Step 10: Write JSON Report
mkdir -p .beagle
Write .beagle/llm-judge-report.json with version, timestamp, repo metadata, weights, scores, ranking, and verdict.
Step 11: Display Summary
Render a markdown summary with the scores table, ranking, verdict, and detailed justifications.
Step 12: Verification
python3 -c "import json; json.load(open('.beagle/llm-judge-report.json'))" && echo "Valid report"
Output Shape
The generated report should include:
- •repo labels and paths
- •per-dimension scores and justifications
- •weighted totals and ranking
- •a verdict explaining the winner
Reference Files
| File | Purpose |
|---|---|
| references/fact-schema.md | JSON schema for Phase 1 facts |
| references/scoring-rubrics.md | Detailed rubrics for each dimension |
| references/repo-agent.md | Instructions for Phase 1 agents |
| references/judge-agents.md | Instructions for Phase 2 judges |
Scoring Model
| Dimension | Default Weight | Evaluates |
|---|---|---|
| Functionality | 30% | Spec compliance, test pass rate |
| Security | 25% | Vulnerabilities, security patterns |
| Test Quality | 20% | Coverage, DRY, mock boundaries |
| Overengineering | 15% | Unnecessary complexity |
| Dead Code | 10% | Unused code, TODOs |
Scoring Scale
| Score | Meaning |
|---|---|
| 5 | Excellent - Exceeds expectations |
| 4 | Good - Meets requirements, minor issues |
| 3 | Average - Functional but notable gaps |
| 2 | Below Average - Significant issues |
| 1 | Poor - Fails basic requirements |
Phase 1: Spawning Repo Agents
For each repository, spawn a Task agent with:
You are a Phase 1 Repo Agent for the LLM Judge evaluation. **Your Repo:** $REPO_LABEL at $REPO_PATH **Spec Document:** $SPEC_CONTENT **Instructions:** Read @beagle:llm-judge references/repo-agent.md Gather facts and return a JSON object following the schema in references/fact-schema.md. Load @beagle:llm-artifacts-detection for dead code and overengineering analysis. Return ONLY valid JSON, no markdown or explanations.
Collect all repo-agent outputs into ALL_FACTS.
Phase 2: Spawning Judge Agents
After all Phase 1 agents complete, spawn 5 judge agents, one per dimension:
You are the $DIMENSION Judge for the LLM Judge evaluation. **Spec Document:** $SPEC_CONTENT **Facts from all repos:** $ALL_FACTS_JSON **Instructions:** Read @beagle:llm-judge references/judge-agents.md Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md. Return ONLY valid JSON following the judge output schema.
Aggregation
- •Collect the five judge outputs.
- •Compute each repo's weighted total with the configured weights.
- •Rank repos by weighted total in descending order.
- •Generate a verdict that explains the result and any close calls.
- •Write
.beagle/llm-judge-report.json.
Output
Display a markdown summary with scores, ranking, verdict, and detailed justifications.
Verification
Before completing:
- •Verify
.beagle/llm-judge-report.jsonexists and is valid JSON. - •Verify all repos have scores for all dimensions.
- •Verify weighted totals sum correctly.
Rules
- •Always validate inputs before proceeding
- •Spawn Phase 1 agents in parallel, then wait before Phase 2
- •Spawn Phase 2 agents in parallel, one per dimension
- •Every score must have a justification
- •Write the JSON report before displaying the summary