AgentSkillsCN

loop-codex-review

通过渐进式的推理层级,自动执行代码审查循环。工具会并行运行 n 次 Codex 审查(可通过 -n 参数灵活配置),Claude 会对每次审查发现的问题逐一回应,逐步提升推理深度,从低级推理逐步深入至高阶推理,直至问题彻底解决(所有 n 次审查均无遗漏)。每一轮审查结束后,还需经过人工审核确认。

SKILL.md
--- frontmatter
name: loop-codex-review
description: Automated code review loop with progressive reasoning levels. Runs n parallel Codex reviews (configurable via -n), Claude addresses findings, climbs from low→xhigh reasoning until fixed point (all n clean). Human approval at each iteration.

Loop: Codex Review

You are a code review coordinator. Codex reviews, Claude addresses. Diverse LLM perspectives.

Core Philosophy

Every finding demands code improvement. No exceptions.

When a reviewer flags something, the code changes. Always. Either:

  • Real bug → fix the code
  • False positive → the code was unclear; add comments or refactor until the intent is obvious
  • Design tradeoff → document the rationale in code comments

There is no "dismiss," no "accept risk," no "wontfix." If a reviewer misunderstood, that's a signal the code isn't self-evident — a tired human would misunderstand too. The code must become clearer.

Fixed point = no reviewer can find anything to flag. Not because you argued them down, but because the code is both correct AND self-evident.

This loop creates a proof: when n independent reviews at each reasoning level (low through xhigh) find nothing to flag, you have strong evidence your code is unambiguous.


Core Concept

code
┌─────────────────┐     ┌───────────────────┐
│  codex review   │────▶│  Claude addresses │
│   (OpenAI CLI)  │     │   (Task agents)   │
└─────────────────┘     └───────────────────┘
         │                       │
         └───────── loop ────────┘
  • Review: Run codex review command via Bash — this is OpenAI's Codex doing analysis
  • Address: Spawn Claude Task agents to address findings (fix code OR clarify with comments/refactoring)
  • Value: Two different frontier LLMs catch different things

Relationship to loop-address-pr-feedback

Aspectloop-codex-reviewloop-address-pr-feedback
WhenPre-PR (local)Post-PR (remote)
Reviewercodex review CLIGitHub bots + humans
TriggerYou run itReviews arrive async
Interfacestdout parsingGitHub API
ScopeSingle diffStack of PRs
Fixed pointAll n reviews cleanAll threads resolved

Use this skill to validate code before opening a PR. Use loop-address-pr-feedback to address reviewer comments after.

Reasoning Levels

Codex supports different reasoning effort levels. Always set explicitly.

code
┌─────────┬────────────────────────────────────────────┬──────────┐
│  Level  │  Description                               │  Time    │
├─────────┼────────────────────────────────────────────┼──────────┤
│  low    │  Quick scan - fast iteration, obvious bugs │   ~3m    │
│  medium │  Moderate depth - good balance             │   ~5m    │
│  high   │  Deep analysis - catches subtle issues     │  ~8-10m  │
│  xhigh  │  Exhaustive - maximum thoroughness         │ ~12-20m  │
└─────────┴────────────────────────────────────────────┴──────────┘

Command syntax:

bash
codex review --base master -c model_reasoning_effort="high"

⚠️ Lower Reasoning Caveat: Reviews at low/medium are faster but may miss subtle bugs. Real example: low and medium both returned clean (all n reviews clean at each level), but high found a case-sensitivity bug (uppercase hex not normalized). Always climb to at least high for production code.

Progressive Strategy (Default)

Default behavior: Climb the reasoning ladder from low → xhigh

code
low (all n clean) → medium (all n clean) → high (all n clean) → xhigh (all n clean) → DONE
         ↑                    ↑                    ↑                     ↑
         └──────── finding? address and restart at this level ──────────┘

Where n is the -n parameter (default: 3). Run n reviews in parallel at each level. If ALL n are clean → advance. If ANY has findings → address and re-run. Higher n = more parallel reviewers = higher confidence.

Note: "Findings" includes both real bugs AND false positives. False positives mean the code is unclear — add comments or refactor until the intent is obvious. See "Verification of Findings" section.

Why progressive?

  • Fast feedback at low levels catches obvious issues quickly
  • Each level validates the previous (higher levels catch what lower missed)
  • User can stop early ("good enough, let's PR") but continuing is automatic
  • Restarting a stopped loop is annoying; stopping a running one is easy

Workflow Overview

code
1. Initialize       → Accept target (--base branch or --uncommitted)
2. Run codex review → Launch n parallel reviews via Bash (run_in_background: true)
3. Parse Output     → Extract findings into tracker
4. Evaluate         → ALL clean? → step 5. Else (findings exist) → step 6.
5. Check Exit       → At xhigh? → Done. Else → advance level, go to step 2.
6. Address Findings → Claude agents address findings (parallel)
7. Verify           → Tests pass, files modified
8. Human Approval   → Present summary, get explicit approval, commit
9. Loop             → Return to step 2

State Schema

Track across iterations. Store in task descriptions for compaction survival.

yaml
iteration_count: 0
review_mode: ""                    # --base <branch> | --uncommitted | --pr <num> | --commit <sha>
review_criteria: ""                # Custom prompt passed to codex review
max_iterations: 15

# Reasoning level tracking
reasoning_level: "low"             # Current: low | medium | high | xhigh
reasoning_strategy: "progressive"  # progressive | fixed
parallel_review_count: 3           # -n flag (default 3) - how many reviews to run in parallel

# Level history (for reporting)
level_history:
  low:    { reviews: 0, findings: 0, fixed_point: false }
  medium: { reviews: 0, findings: 0, fixed_point: false }
  high:   { reviews: 0, findings: 0, fixed_point: false }
  xhigh:  { reviews: 0, findings: 0, fixed_point: false }

issue_tracker: []

Phase: Initialize

Do:

  • Detect base branch properly (check for Graphite stack first)
  • Parse review mode from args
  • Initialize state and create tracking task

Don't:

  • ❌ Assume master/main is the base — check for stack parent first
  • ❌ Skip base branch detection — wrong base = useless review

On activation:

  1. Determine review mode from args:

    • No args or directory → --uncommitted (review working changes)
    • --base <branch> → review changes vs branch
    • --pr <num>--base against PR's target branch
    • --commit <sha> → review specific commit
  2. Detect base branch:

    bash
    # Check if in a Graphite stack
    gt ls 2>/dev/null
    
    • If in a stack, the base is the parent branch, not master
    • Use gt log --oneline or check PR target to find actual base
    • Only the bottom of a stack targets master/main
  3. Parse optional criteria (custom review prompt)

  4. Initialize state, create tracking task

Base branch detection:

code
Stack example (gt ls):
  ◉ feature-c  ← current (base: feature-b)
  ◉ feature-b  (base: feature-a)
  ◉ feature-a  (base: master)
  ◉ master

In this case, reviewing feature-c should use --base feature-b, NOT --base master.

Args examples:

bash
# Default: progressive low → xhigh, 3 parallel reviews per level
/loop-codex-review                          # --uncommitted, full climb
/loop-codex-review --base master            # Review vs master, full climb

# Start at specific level
/loop-codex-review --level high             # Start at high, climb to xhigh
/loop-codex-review --level xhigh            # Start at xhigh (skip lower levels)

# Fixed level (no climbing)
/loop-codex-review --level medium --no-climb  # Stay at medium only

# Quick mode (low only, for fast iteration during development)
/loop-codex-review --quick                  # Alias for --level low --no-climb

# Parallel review count: -n sets how many reviews run in parallel per level
/loop-codex-review -n 10                    # High confidence (10 parallel reviews)
/loop-codex-review -n 1                     # Fast/yolo mode (1 review per level)
/loop-codex-review --quick -n 1             # Fastest possible (low only, 1 review)

# With custom criteria
/loop-codex-review "check for security issues" --level high

# Auto-detect base from Graphite stack
/loop-codex-review --base auto              # Uses gt to find parent branch

The -n parameter: Controls how many reviews run in parallel at each level. All n must be clean to advance. Default is 3. Higher values = more diverse perspectives = higher confidence. Max recommended is 10.

Auto-detection logic:

  1. If gt available → check parent with gt log --oneline -n 1 or parse gt ls
  2. Else if in PR → use gh pr view --json baseRefName
  3. Else → fall back to master/main

Phase: Review (THE KEY PART)

This runs the actual codex review CLI command — NOT a Claude agent.

Do:

  • Use Bash tool directly with run_in_background: true
  • Launch all n reviews in a single message (parallel)
  • Always set -c model_reasoning_effort explicitly
  • Record all task IDs for polling later

Don't:

  • ❌ Use Task agents for review — they interpret prompts unpredictably (e.g., tail -f blocking forever)
  • ❌ Run reviews sequentially — always parallel
  • ❌ Forget -c model_reasoning_effort — Codex defaults are unpredictable
  • ❌ Use tail -f to check output — it blocks forever; use tail -n or cat

Example: Launch n Parallel Reviews

code
# If n=3 (default), launch 3 in a single message:
Bash(command: "codex review --base master -c model_reasoning_effort=\"low\" 2>&1", run_in_background: true, description: "Codex review 1/3 (low)")
Bash(command: "codex review --base master -c model_reasoning_effort=\"low\" 2>&1", run_in_background: true, description: "Codex review 2/3 (low)")
Bash(command: "codex review --base master -c model_reasoning_effort=\"low\" 2>&1", run_in_background: true, description: "Codex review 3/3 (low)")

# If n=10, launch 10 in a single message:
Bash(command: "...", run_in_background: true, description: "Codex review 1/10 (low)")
# ... repeat for all n reviews

Each call returns a task_id and output_file path. Record these for polling.

Fixed point = all n clean. If ANY review has findings, address them and re-run all n reviews at this level.

Command Construction

ModeCommand
uncommittedcodex review --uncommitted -c model_reasoning_effort="high"
vs branchcodex review --base <branch> -c model_reasoning_effort="high"
vs stack parentcodex review --base feature-b -c model_reasoning_effort="high"
specific commitcodex review --commit abc123 -c model_reasoning_effort="high"
with criteriacodex review --uncommitted "check for SQL injection" -c model_reasoning_effort="high"

Important: When in a Graphite stack, always review against the parent branch, not master.

Polling: Use cat or tail -n (NOT tail -f) to check output files. See "Handling Zombie Tasks" section.

Phase: Parse Output

Do:

  • Extract all findings from codex review output
  • Parse into issue tracker format
  • Record the reasoning level that found each issue

Don't:

  • ❌ Skip findings because they seem minor — every finding gets tracked
  • ❌ Combine multiple findings into one — each gets its own ID

Codex review outputs markdown with findings. Parse into tracker:

markdown
| ID | File | Line | Severity | Description | Status | Iter | Level |
|:--:|:-----|:----:|:--------:|:------------|:------:|:----:|:-----:|
| CR-001 | src/auth.js | 42 | major | SQL injection | open | I1 | high |

Evaluate n Parallel Results

code
results = [review_1, review_2, ..., review_n]

if ALL n results are clean:
    # Fixed point at this level!
    if reasoning_level == "xhigh":
        → DONE (full fixed point reached)
    else:
        → Advance to next reasoning level
else:
    # ANY review has findings
    → Merge all findings into tracker, proceed to address phase
    → After addressing, re-run all n reviews at this level

Verification of Findings

Do:

  • Verify each finding before addressing (especially at lower reasoning levels)
  • Ask: real bug, false positive, or design tradeoff?
  • Triage using this table:
Finding TypeResolution
Real bugFix the code
False positiveAdd comments or refactor until the intent is obvious
Design tradeoffDocument the rationale in code comments
UnclearResearch before deciding

Don't:

  • ❌ Address without verifying first — lower reasoning levels have more false positives
  • ❌ Dismiss findings without improving code — every finding = code change
  • ❌ Blame the reviewer for misunderstanding — if an LLM gets confused, a human will too

Critical insight: False positives are documentation bugs.

When a reviewer misunderstands your code, the code is unclear. If an LLM gets confused, a tired human will too. The resolution is NOT to dismiss — it's to add comments or refactor until the intent is obvious.

Example: A reviewer flags an empty catch block as "swallowing errors." But you're intentionally ignoring that specific error. The resolution isn't to dismiss — it's to add a comment:

javascript
} catch (e) {
  // Intentionally ignored: retries handle this upstream
}

Now the next reviewer (human or LLM) won't raise the same concern. The false positive becomes impossible.

Synthesize Before Addressing

⚠️ Always zoom out before addressing any finding.

Reviewers do deep analysis but output terse summaries. A finding that looks like a one-line change often touches code with multiple exit paths, callers, and implicit contracts. Addressing the symptom without understanding the system leads to incomplete or wrong resolutions.

This step is not optional, and it's not just for "complex" findings. Even when a single reviewer flags a single line, ask: why was this subtle enough that others missed it? What else in this area might have similar issues?

The protocol:

  1. Read the full context — Not just the flagged line. Read the entire function, its callers, and sibling code. The summary is a pointer; the truth is in the source.

  2. Map the system — Trace the relevant paths:

    • All exit points from the function
    • All callers and call sites
    • All reads and writes of affected state
  3. Look for patterns — Findings in the same file or touching the same concept (error handling, validation, cleanup) may share a root cause. A single finding may reveal a pattern repeated elsewhere.

  4. Ask the hard questions:

    • What contract should this code uphold?
    • Does every path honor that contract?
    • What would a surface-level fix miss?
    • Is there a structural issue underneath?
  5. Challenge yourself — "Is this my best effort? What haven't I considered?"

The goal is to reconstruct the full picture before acting. Understand the system, then address holistically.

Phase: Address (Claude Agents)

Do:

  • Check exit conditions before spawning any agents
  • Ask user for restart strategy when findings exist
  • Spawn agents in parallel with run_in_background: true
  • Group findings by file when sensible

Don't:

  • ❌ Skip exit check — you might already be done
  • ❌ Address findings without user input on restart strategy
  • ❌ Run address agents sequentially — always parallel

Exit check first:

code
if all_n_clean AND reasoning_level == "xhigh":
    → Done (full fixed point reached)
if all_n_clean:
    → Advance to next reasoning level
if iteration_count >= max_iterations:
    → Ask user how to proceed

When Findings Exist: Ask User for Strategy

Use AskUserQuestion to let user choose restart strategy:

code
"Found {count} findings at {level} reasoning. After addressing, how should we verify?"

Options:
1. "Re-review at [current level]" (recommended) - Verify resolution at same depth
2. "Restart from low" - Full re-climb, maximum confidence
3. "Drop one level and re-climb" - Balance of speed and thoroughness
4. "Skip to next level" - Trust the resolution, continue climbing

Context matters: A subtle edge case found at high probably just needs re-review at high. A fundamental issue that low should have caught might warrant a full restart.

Spawn Claude Address Agents

Spawn address agents in parallel via Task tool:

  • One agent per finding (or grouped by file)
  • run_in_background: true for parallel execution
  • Agent prompt includes finding details from Codex's review
code
Task(
  description: "Address CR-001: SQL injection",
  prompt: "Address the SQL injection finding from code review...",
  subagent_type: "general-purpose",
  run_in_background: true
)

Phase: Verify

Do:

  • Run tests (make test or equivalent)
  • Verify files were actually modified
  • Update issue tracker: addressingfixed or clarified

Don't:

  • ❌ Skip test verification
  • ❌ Proceed if tests fail — address test failures first

Phase: Human Approval

Do:

  • Present detailed summary with full context
  • Use AskUserQuestion with clear options
  • Wait for explicit approval before committing

Don't:

  • ❌ Skip this checkpoint — human approval is mandatory
  • ❌ Commit without explicit "Approve and commit" response

Present detailed summary with enough context to make an informed decision:

markdown
## Iteration {N} — Detailed Review

### CR-001: [Short title] (severity)

**The Finding:**
[2-3 sentences explaining what the reviewer flagged, where it occurs, and why it matters.]

**The Resolution:**
[What changed. For bugs: the fix. For unclear code: the clarifying comment or refactor.]

**Impact:** [One line on what this improves]

---

### CR-002: [Short title] (severity)

**The Finding:**
[Same format...]

**The Resolution:**
[Same format...]

**Impact:** [...]

---

### Summary

| ID | File | Change |
|----|------|--------|
| CR-001 | src/auth.js | String concat → parameterized query |
| CR-002 | src/api.ts | Added comment explaining intentional behavior |

### Resolutions
- **CR-001**: Fixed SQL injection via parameterized query
- **CR-002**: Added comment clarifying why null check is unnecessary here

### Verification
- [x] Tests passing (N/N)
- [x] Files modified: src/auth.js, src/api.ts

Key principle: The human needs enough context to understand what was flagged, why it matters, and how Claude addressed it — without having to dig through logs or diffs.

AskUserQuestion with options:

  1. "Approve and commit" — commit changes, continue to next review
  2. "View full diff" — show git diff, then re-ask
  3. "Request changes" — user specifies modifications
  4. "Abort" — exit loop, keep changes uncommitted

Phase: Commit

Do:

  • Commit only after explicit human approval
  • Include all resolved findings in commit message
  • Loop back to Phase: Review after committing

Don't:

  • ❌ Commit without human approval
  • ❌ Commit before addressing all findings from current review round

After human approval:

bash
git add -A && git commit -m "$(cat <<'EOF'
codex-review: Fix issues from iteration {N}

Issues resolved:
- CR-001: SQL injection in auth.js (major)

Reviewed by: OpenAI Codex
Fixed by: Claude

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
EOF
)"

Then loop back to Phase: Review.

Fixed Point

Do:

  • Require ALL n reviews clean to declare fixed point
  • Climb all the way to xhigh (default behavior)
  • Re-run all n reviews after addressing any finding

Don't:

  • ❌ Trust low/medium clean reviews as "done" — always climb to at least high
  • ❌ Stop at first fixed point — default is full climb to xhigh
  • ❌ Declare fixed point if ANY review has findings

The True Definition

A true fixed point requires BOTH:

  1. No real bugs — the code is correct
  2. No false positives — the code is clear enough that reviewers understand it

False positives are bugs in your documentation, not bugs in the reviewer.

If 1 in 10 reviewers misunderstands your code, that's a 10% confusion rate. Address it by adding comments until the confusion rate hits 0%. Don't dismiss — clarify, then re-run to verify.

Per-Level Fixed Point

When all n parallel reviews return clean at any level:

code
All n reviews at [level] found nothing.
Fixed point at [level]. Advancing to [next level]...

Full Fixed Point

When all n reviews return clean at xhigh:

code
┌─────────────────────────────────────────────────────────┐
│  FULL FIXED POINT REACHED                               │
├─────────────────────────────────────────────────────────┤
│  low:    n/n clean ✓                                    │
│  medium: n/n clean ✓                                    │
│  high:   n/n clean ✓                                    │
│  xhigh:  n/n clean ✓                                    │
├─────────────────────────────────────────────────────────┤
│  Total reviews: 4n* |  Findings addressed: X            │
│  Code has been validated at all reasoning depths.       │
└─────────────────────────────────────────────────────────┘

*If started from a higher level (e.g., --level high), total is fewer.

Report final summary with level history and exit.

Issue Tracker Format

Maintain throughout session:

code
┌────────┬─────────────┬──────┬──────────┬─────────────────────────────────┬──────────┬───────┬───────┐
│ ID     │ File        │ Line │ Severity │ Description                     │ Status   │ Iter  │ Level │
├────────┼─────────────┼──────┼──────────┼─────────────────────────────────┼──────────┼───────┼───────┤
│ CR-001 │ src/auth.js │ 42   │ major    │ SQL injection                   │ fixed    │ I1    │ high  │
│ CR-002 │ src/api.ts  │ 108  │ minor    │ Missing null check              │ fixed    │ I1    │ high  │
│ CR-003 │ src/util.js │ 15   │ style    │ Unused import (false positive)  │ clarified│ I2    │ xhigh │
└────────┴─────────────┴──────┴──────────┴─────────────────────────────────┴──────────┴───────┴───────┘

Severities: critical | major | minor | style Statuses: open | addressing | fixed | clarified

Status transitions:

  • open → when finding is first recorded
  • addressing → when an agent is actively working on it
  • fixed → real bug was fixed in code
  • clarified → false positive addressed with comments/refactoring

Don't:

  • ❌ Use "wontfix" status — it doesn't exist
  • ❌ Leave any finding unaddressed — every finding = code improvement

See Core Philosophy: every finding results in code change (fix OR clarify).

Resumption (Post-Compaction)

  1. Run TaskList to find review loop task
  2. Read task description for persisted state
  3. Check for running background Bash (codex review) or Task agents
  4. Resume from appropriate phase

Handling Zombie Tasks

⚠️ Background task notifications are unreliable. ~50% of notifications are lost when you're mid-response. Don't trust them.

The Pattern:

  1. Record IDs when launching — Note the task IDs from each Bash() call
  2. Wait briefly — Give tasks time to run (~3m for low, ~5m for medium, ~8-10m for high, ~12-20m for xhigh)
  3. Poll proactively — Don't wait for notifications; check the files directly
  4. If user says "tasks are done" — Trust them and poll immediately

Polling Recipe:

bash
# The output_file path is returned when you launch each task
# Check if task is complete (look for final JSON line with result)
tail -1 <output_file>

# Batch check all tasks (using the paths returned at launch time)
for f in /path/to/task1.output /path/to/task2.output; do echo "=== $f ==="; tail -1 "$f"; done

⚠️ NEVER use tail -f — it blocks forever. Always use cat or tail -n which terminate after reading.

Key insight: Treat notifications as hints, not guarantees. When in doubt, poll.

Contradictory Findings

When successive reviews recommend opposing changes, this signals genuine design tension:

  1. Pause — Don't implement the latest suggestion reflexively
  2. Enumerate solutions — Map all approaches with their tradeoffs
  3. Clarify requirements — Use AskUserQuestion to understand which constraints are hard vs soft
  4. Search for synthesis — Often a solution exists that satisfies multiple constraints
  5. Commit deliberately — If no synthesis exists, choose and document the rationale

Contradictory findings usually indicate underspecified requirements, not wrong reviews.

Quick Reference: Don'ts

Pre-flight checklist. Details are inline in each section above.

SectionDon't
InitializeAssume master is base, skip base branch detection
ReviewUse Task agents, run sequentially, forget -c model_reasoning_effort, use tail -f
Parse OutputSkip findings because they seem minor, combine multiple findings into one
Verification of FindingsAddress without verifying, dismiss without improving code, blame reviewer
AddressSkip exit check, address without user strategy input, run agents sequentially
VerifySkip tests, proceed if tests fail
ApprovalSkip checkpoint, commit without explicit approval
CommitCommit without approval, commit before addressing all findings
Fixed PointTrust low/medium as done, stop at first fixed point, declare fixed point if ANY review has findings
Issue TrackerUse "wontfix" status, leave findings unaddressed

Enter loop-codex-review mode now. Parse args for review mode and starting level (default: low, climbing to xhigh). Launch n parallel codex review commands via Bash tool with run_in_background: true (where n = -n flag, default 3). All n must be clean to advance to next level. Always set -c model_reasoning_effort explicitly. Do NOT do the review yourself — delegate to Codex via the CLI.