Multi Review
Runs the code-review skill with 3 different models in parallel, then synthesizes with active validation.
Process
Phase 1: Gather Reviews
- •
Get the PR diff (same as code-review)
bash# If PR number provided, use it. Otherwise current branch. gh pr diff [PR_NUMBER] > /tmp/pr-diff.txt
- •
Run 3 parallel reviews via bash
bashpi -p --model claude-opus-4-5 "Read and follow ~/dev/pi-skills/code-review/SKILL.md to review the PR. Diff is at /tmp/pr-diff.txt" > /tmp/review-opus.md & pi -p --model gpt-5.2-codex "Read and follow ~/dev/pi-skills/code-review/SKILL.md to review the PR. Diff is at /tmp/pr-diff.txt" > /tmp/review-codex.md & pi -p --model gemini-2.5-pro "Read and follow ~/dev/pi-skills/code-review/SKILL.md to review the PR. Diff is at /tmp/pr-diff.txt" > /tmp/review-gemini.md & wait
Phase 2: Active Validation (IMPORTANT)
Do not blindly trust the reviewers. Validate each finding yourself.
- •
Read PR context first Before looking at sub-agent reviews, get the full picture:
bash# What the PR claims to do gh pr view [PR_NUMBER] --json title,body # What it actually does cat /tmp/pr-diff.txt # What others have already said gh pr view [PR_NUMBER] --json comments,reviews --jq '.comments[].body, .reviews[].body'
Form your own impressions. Note any issues already flagged in PR feedback.
- •
Collect all findings Build a deduplicated list of every issue from all 3 reviews. Note which model(s) found each issue.
- •
Validate EACH finding For every finding, actually look at the code and verify:
- •Is this a real bug/issue? (check the code, don't just trust the claim)
- •Is it a false positive? (model hallucinated or misunderstood)
- •What file/line is affected? (verify it exists and matches)
- •
Score by IMPACT, not consensus Rate each validated issue by actual severity:
- •🔴 Critical: Breaks functionality, security issue, data loss
- •🟠 High: Real bugs, incorrect behavior, major guideline violations
- •🟡 Medium: Performance, maintainability, edge cases
- •🟢 Low: Style, minor improvements, nitpicks
Consensus count (2+ models) ≠ importance.
- •Consensus often means "obvious issue any reviewer would catch"
- •Unique findings may be subtle insights worth MORE attention, not less
- •
Flag unique findings for extra scrutiny When only one model found something:
- •WHY did only one catch it? (deeper insight vs hallucination?)
- •Validate more carefully - could be the most important find
- •Could also be a false positive - verify against actual code
- •
Check for gaps What might ALL models have missed?
- •Complex state/timing issues (e.g., async race conditions)
- •Claimed features that don't actually work (check PR description)
- •Subtle logic errors in control flow
- •Look at the PR description - are all claims implemented?
Phase 3: Synthesized Output
- •Output format
# 🔍 Multi-Model PR Review: [PR title] ## Validated Issues ### 🔴 Critical [Issues that must be fixed - functionality broken, security, etc.] ### 🟠 High Priority [Real bugs, incorrect behavior - should fix before merge] ### 🟡 Medium Priority [Performance, maintainability, edge cases - should discuss] ### 🟢 Low Priority [Style, minor improvements - nice to have] Each issue should include: - **File**: path/to/file.ext#L10-L15 - **Status**: ✅ Confirmed | ⚠️ Needs verification | ❌ False positive - **Found by**: Opus / Codex / Gemini / PR feedback - **Description**: What's wrong and why it matters - **Suggestion**: How to fix (if applicable) ## ❌ False Positives Filtered [List any findings that were wrong, with brief explanation] ## ⚠️ Potential Gaps [Things all models may have missed - especially check PR description claims] ## 📊 Model Coverage | Issue | Opus | Codex | Gemini | PR | Status | | ------- | :--: | :---: | :----: | :-: | ----------------- | | Issue 1 | ✅ | ✅ | ❌ | - | ✅ Confirmed | | Issue 2 | ❌ | ✅ | ❌ | - | ✅ Confirmed | | Issue 3 | ✅ | ❌ | ✅ | - | ❌ False positive | | Issue 4 | ❌ | ❌ | ❌ | ✅ | ⚠️ Models missed! | ## Final Verdict **[MERGE / FIX FIRST / NEEDS DISCUSSION]** [Brief explanation of verdict]
Key Principles
- •Validate, don't just synthesize - You are the senior reviewer, not a secretary
- •Unique findings deserve MORE attention - They might be the deepest insights
- •Consensus ≠ importance - Obvious issues get caught by all; critical bugs may be subtle
- •Check what's missing - The worst bugs are the ones no one found
- •Compare against PR description - Do claimed features actually work?