Agent Review Panel

4–6 AI reviewers independently evaluate your code, plan, or docs, debate each other's findings, then a judge resolves disagreements. A Claude Code plugin that orchestrates structured multi-stance review. Each run costs roughly $3–$20 in Opus tokens and takes 6–15 minutes — built for high-stakes reviews, not routine code review.

▶️ Play the interactive version — pause, replay, jump to any stage. Your code/plan journeys Gather → Review → Debate → Verify → Adjudicate → Report. (static pipeline diagram)

Runs only on Claude Code surfaces (CLI, IDE extension, Desktop "Code" tab) and the Claude Agent SDK. Does not work on claude.ai web chat or the raw Anthropic API — see Surfaces & Requirements.

Quick Start

hljs language-bash

claude plugin marketplace add wan-huiyan/agent-review-panel
claude plugin install roundtable@agent-review-panel

Restart your Claude Code session — skills load at session start.

Already inside a Claude Code session? Use the slash-command form

hljs language-bash

/plugin marketplace add wan-huiyan/agent-review-panel
/plugin install roundtable@agent-review-panel

Same effect. Shell form (claude plugin …) from the terminal; REPL form (/plugin …) from inside Claude Code.

Verify it loaded. In a fresh Claude Code session, type /roundtable:agent-review-panel with no arguments. The skill should introduce itself and ask what to review. If you get unknown command, see Troubleshooting. Migrating from an older install handle? See MIGRATION.md.

⚡ Recommended — add VoltAgent specialists. Reviews get noticeably sharper when VoltAgent specialist agents are installed: the panel auto-upgrades each reviewer to a real domain specialist (e.g. Security Auditor → voltagent-qa-sec:security-auditor) instead of a generic persona. Not installed? It falls back gracefully — nothing breaks.

hljs language-bash

claude plugin marketplace add VoltAgent/awesome-claude-code-subagents
claude plugin install voltagent-qa-sec@voltagent-subagents   # backs the core code-review personas

Add voltagent-lang, voltagent-data-ai, voltagent-infra, … as you need them (10 families, 130+ agents; the /plugin … REPL form works too — takes effect next session). Why it's worth it → Sharper reviews with VoltAgent. (VoltAgent installs through Claude Code's plugin system.)

Use:

hljs language-shell

> Review this implementation plan from multiple perspectives: docs/my_plan.md

> /roundtable:agent-review-panel

Natural-language invocation also works ("red team this", "stress-test this design", "get a panel review of …").

What you get — three files written to your current working directory:

review_panel_report.md — executive summary, consensus, judge-resolved disagreements, action items
review_panel_process.md — verbatim "director's cut" of every reviewer's output
review_panel_report.html — interactive dashboard with expandable issue cards, charts, and a panel gallery

See Sample output below for what a real report looks like.

When to use it — and when not

Use a panel for: high-stakes plans you'd want a second/third/fourth opinion on; security-sensitive code; architecture decisions that are expensive to reverse; docs that have to be right; "I'm too close to this — stress-test it."

Don't use a panel for (use a single review or /review instead): a quick look before pushing; type-error fixes; routine code review; addressing existing PR comments; "what does this code do?"; deployments; writing tests or docs from scratch.

The key signal is multiple independent perspectives — if one opinion would do, you're paying 4–6× more for nothing.

Why a panel, not a single reviewer?

Ask one reviewer "review this" and you get one perspective. It won't argue with itself, catch its own blind spots, or volunteer "I'm not sure about this."

The panel forces the same model to take multiple passes from assigned, divergent stances — Correctness Hawk, Security Auditor, Devil's Advocate, etc. — before they see each other's output. Then they debate, then a separate judge step resolves disputes against verified evidence. You get a deliberation, not a list:

Feasibility Analyst: "The data_available_through hardcoding is minor — it's documented."

Risk Assessor: "Disagree. If stale, the lookforward extends past actual data — model trains on incomplete outcomes — silent false-negative bias."

Feasibility Analyst (Round 2): "Valid point. I upgrade this to IMPORTANT."

Honest caveat: all reviewers are Claude instances running the same base model. This is structured self-critique, not independent verification — unanimous agreement may reflect shared model bias rather than ground truth. The architecture includes anti-groupthink safeguards (blind final scoring, sycophancy detection, correlated-bias warnings) but cannot eliminate the structural limitation. The value is the forced multi-stance discipline plus verification against actual source, not the appearance of disagreement.

Sample output

The panel was run against this very README on 2026-05-14. The interactive HTML dashboard:

Expandable 10-section issue cards (narrative, code evidence with Prism.js syntax highlighting, debate transcripts, judge rulings, fix recommendations, cross-references), deep-linkable, keyboard-navigable, print-friendly. The GIF above links to the actual generated report for this README.

Excerpt from the matching review_panel_report.md:

hljs language-markdown

# Review Panel Report — README.md

**Verdict:** REVISE — substantial edit needed | **Confidence:** Medium-High
**Score:** 5/10 | **Mode:** Exhaustive (pure documentation)

## Executive Summary

The README is factually careful where it counts — install commands, marketplace
handles, slash commands, the test-count claim, and spot-checked anchors all verify
cleanly against the repo. But three problems pull the score to 5/10: the
version/release story is incoherent and actively misleads every current
user, the document is roughly a third too long, and it never shows what a
review actually produces.

## Consensus Points (judge-confirmed)

- Version/release story is broken — all four reviewers, independently.
  package.json declares 3.3.0; the only git tags are up to v3.1.0. The
  "Updating" section tells users to verify against "the latest GitHub
  release" — which resolves to v3.1.0, so every correctly-installed
  v3.3.0 user concludes their install is wrong.  [VERIFIED][CONSENSUS]

## Action Items

| # | Severity | Action                                                       |
|---|----------|--------------------------------------------------------------|
| 1 | P1 [VERIFIED][CONSENSUS] | Fix the version/release story.            |
| 3 | P1 [CONSENSUS]           | Cut ~⅓ length; split audiences.           |
| 4 | P1 [VERIFIED]            | Add a text sample of a real report.       |

You're reading the rewrite that lands those findings. The full artifacts live at:

docs/reviews/2026-05-14-readme/review_panel_report.md — the full markdown report (13 action items)
docs/reviews/2026-05-14-readme/review_panel_report.html — the interactive HTML dashboard
docs/reviews/2026-05-14-readme/review_panel_process.md — verbatim process log with all four reviewers' raw output

Sharper reviews with VoltAgent specialists

By default every reviewer is a generic Claude agent wearing a persona prompt ("act as a Security Auditor"). That works — but if you install VoltAgent's specialist agents, the panel upgrades each persona to a matching domain specialist whose expertise is built into its own system prompt: a real Security Auditor, Database Optimizer, or SRE instead of a generalist asked to role-play one.

Automatic, zero-config. During setup (Phase 1) the panel scans for installed voltagent-* agents and routes personas to them. No specialist installed for a given role? That reviewer falls back to a generic persona — nothing to configure, nothing breaks.
130+ specialists across 10 families — qa-sec (code review, security, performance, chaos), lang (20+ language experts), data-ai, infra, core-dev, and more. Content signals auto-add the right one: SQL → a database optimizer, Terraform → an IaC engineer, React → a frontend specialist. v3.4 added 30 such signal→specialist mappings, roughly doubling coverage.
Install what you review. voltagent-qa-sec backs the core code-review personas; add voltagent-lang / voltagent-data-ai / voltagent-infra for language-, ML-, or infra-heavy work. Commands are in Quick Start.

VoltAgent specialists install through Claude Code's plugin system — the same Claude Code surfaces the panel itself runs on (CLI / IDE / Desktop "Code" tab / Agent SDK). Codex isn't a supported surface for the panel: it needs Claude Code's Agent tool to spawn reviewers in parallel — see Surfaces & Requirements.

Surfaces & Requirements

This plugin needs the Claude Code Agent tool for parallel subagent spawning, local-filesystem access for output files, and a plugin loader.

Works ✅

CLI — claude in your terminal
VS Code extension — Claude Code extension
JetBrains IDE extension — IntelliJ, PyCharm, WebStorm, GoLand, Rider, etc.
Claude Desktop app → Code tab — the dedicated "Code" tab inside the Desktop app (docs)
Claude Agent SDK — load via options.plugins: [{ type: "local", path: "./agent-review-panel" }] (docs)

Does not work ❌

claude.ai web chat and Claude Desktop regular chat tabs — no /plugin marketplace; Agent Skills there can't spawn parallel subagents or write the output files
Anthropic Messages API called directly (without the Agent SDK) — no plugin loader, no subagent orchestration, no filesystem. Use the Agent SDK entry above.

Prerequisites in one place:

Claude Code v1.0+ on a supported surface, or the Claude Agent SDK
A Claude Pro / Max subscription or API access (Opus is the reviewer model)
Node ≥18 (only needed if running npm test or developing locally)
Recommended: VoltAgent specialist agents — the panel auto-upgrades reviewers to domain specialists (why it helps)

Don't have Claude Code yet? Install it from claude.ai/code, then come back to Quick Start.

Marketplace install (recommended)

Install commands are in Quick Start. The marketplace install handles caching, version tracking, and trigger-phrase activation in one step.

Command format: @<marketplace-name>, not @<repo-name>. The marketplace name is agent-review-panel (defined in .claude-plugin/marketplace.json); the plugin install name inside it is roundtable. Hence roundtable@agent-review-panel. If you installed under an older marketplace name, see MIGRATION.md.

Why the marketplace path? The repo ships .claude-plugin/marketplace.json + .claude-plugin/plugin.json manifests that Claude Code reads to register the plugin. The marketplace install reads those manifests and wires everything up in one step. Manual clone works but bypasses the manifests.

Manual clone

For local development, forking, or air-gapped environments. Do NOT clone into ~/.claude/skills/ — that path shadows marketplace installs.

hljs language-bash

git clone https://github.com/wan-huiyan/agent-review-panel.git ~/projects/agent-review-panel-dev
claude --plugin-dir ~/projects/agent-review-panel-dev

Cursor (untested by maintainers)

Community-contributed starting points — see Contributing if you want to help

Per-project rule:

hljs language-bash

mkdir -p .cursor/rules
# Create .cursor/rules/agent-review-panel.mdc with the content of SKILL.md
# Add frontmatter: alwaysApply: true

Manual global install:

hljs language-bash

git clone https://github.com/wan-huiyan/agent-review-panel.git ~/.cursor/skills/agent-review-panel

The core pattern is straightforward — one subagent per reviewer in Phase 3, collect results, debate in Phase 5, verification and judge after. PRs welcome.

How It Works

A run executes 16 top-level phases (plus two verification gates — 13.5 and 14.5 — and an optional multi-run merge at Phase 16). The numbered phase table:

Stage	Phase	Action
Gather	1	Setup — scan sibling dirs, trace references, discover safeguards, detect signals, select personas
	2	Data Flow Trace (code only) — trace critical path(s), document schemas at each function boundary, flag composition/seam bugs
Review	3	Independent Review — 4–6 reviewers evaluate in parallel (no cross-talk)
	4	Private Reflection — each reviewer re-reads and rates own confidence
Debate	5	Adversarial Debate (1–3 rounds) — reviewers engage + find new issues
	6	Round Summarization — distill resolved/unresolved points between rounds
	7	Blind Final — each reviewer gives final score independently
Verify	8	Completeness Audit — dedicated agent scans for what the panel missed
	9	Verify Commands — run reviewer grep/read commands for P0/P1 findings (advisory)
	10	Claim Verification — verify all line-number citations against source
	11	Severity Verification — read actual code for every P0/P1; downgrade if overstated; web-verify external domain claims
	12	Tier Assignment — confidence-based draft → judge-advised refinement per dispute
	13	Targeted Verification — persona-matched agents investigate each dispute point
	13.5	Pre-Judge Verification Gate — orchestrator confirms all phase outputs exist on disk before launching the judge
Adjudicate	14	Supreme Judge — Opus arbitrates everything including verification-round evidence
	14.5	Post-Judge Verification Gate — re-verifies judge-introduced P0/P1 against ground truth (catches judge hallucinations)
Output	15	Triple output: Primary Report (`.md`) + Process History (`.md`) + Expandable-card Dashboard (`.html`)
Merge	16	Multi-Run Merge (optional) — deduplicate findings across runs, score stability, resolve judge divergence

Phase 15 has three sequential sub-steps (15.1 → 15.2 → 15.3) so the HTML agent can read the already-written markdown and process files from disk.

Choosing a review mode

This skill is the full adversarial panel — its defining feature is the Phase 5–7 debate, where reviewers cross-examine each other before a judge rules. Debate is expensive (sequential cross-talk) and only pays off when reviewers would genuinely change each other's verdicts. Pick deliberately:

You want	Use	Debate?
Fast eyes on a tiny PR	`/code-review`, single-persona review	no
Independent parallel lenses on small / autonomous multi-PR work	streamlined panel (no-debate)	no, by design
Adversarial tradeoff, high-stakes gating, a verdict debate would change (security vs perf, "is this P0 real", merge go/no-go)	this skill (full panel) — invoke as a skill, not a workflow	yes

Debate lives in the skill's Agent-tool orchestration. Running a review as a Workflow / ultracode task routes it through a parallel fan-out engine (agents never see each other) whose canonical shape is "find → verify → judge" — structurally debate-less. If a run produces no debate, the report now stamps a loud [NO-DEBATE] banner and caps confidence at Medium, so a debate-less review is never silent. To get the panel's depth under a Workflow, author an explicit Debate phase (see the debate-in-Workflow recipe in SKILL.md).

Features

Review process

4–6 reviewers with distinct personas evaluate in parallel, then debate across 1–3 rounds
Auto-selects personas from content type (code, plan, docs, mixed) and 10 technology signal groups (SQL/Data, Auth/Security, Infrastructure, ML/Statistics, API/Integration, Frontend, Cost, Pipeline, Repo/Data Hygiene, Skill/Docs Portability)
Each reviewer uses a different reasoning strategy (systematic enumeration, adversarial simulation, backward reasoning, etc.)
Auto Precise/Exhaustive mode: code requires line citations; plans allow broader risk identification

Verification layer

Claim verification checks all reviewer citations against actual source code
Severity verification reads the codebase to confirm P0/P1 findings before the judge sees them. External domain claims (product limits, regulatory jurisdiction, API behavior) are automatically web-searched and tagged [WEB-VERIFIED], [WEB-CONTRADICTED], or [WEB-INCONCLUSIVE]
Verification commands: runs read-only grep/cat commands from reviewers to confirm or contradict claims
Defect classification: findings labeled [EXISTING_DEFECT] or [PLAN_RISK] — P0 requires existing defect evidence
Live-State Claim Discipline: findings asserting facts about live infrastructure or runtime state (deployed IAM, running crons, production env vars) must distinguish declarative config from imperative documentation from live describe-class output. Tagged [LIVE-VERIFIED] or [STATIC-INFERENCE] — the latter is capped at P1
Completeness audit: post-debate agent re-reads source line-by-line for what everyone missed
Targeted verification: each unresolved dispute gets a tiered (Light / Standard / Deep) verification agent matched to the claim type
Data Flow Trace (code only, three tiers): a dedicated agent traces data through critical paths before reviewers begin, flagging composition/seam bugs
Post-judge verification gate: re-verifies any P0/P1 the judge introduced (vs. raised by panel) against ground truth; hallucinated findings are demoted and the verdict score is recomputed
Force-opus: every subagent launch passes model: "opus" explicitly to eliminate cross-run reasoning variance

Anti-groupthink safeguards

Blind final scoring, private reflection, calibrated skepticism levels (20–60%)
Sycophancy detection intervenes when >50% of position changes lack new evidence
Anti-rhetoric assessment flags position changes driven by eloquence rather than evidence
Judge confidence gating: low-confidence verdicts flag ⚠️ HUMAN REVIEW RECOMMENDED
Correlated-bias warning when all reviewers converge (unanimous agreement is the most dangerous failure mode)

Outputs (three files per review)

Primary report (review_panel_report.md): executive summary, consensus, disagreements (with judge rulings), prioritized action items with epistemic labels — see Reading the Report.
Process history (review_panel_process.md): verbatim "director's cut" log of every agent's output, with persona profiles embedded.
Interactive HTML dashboard (review_panel_report.html) with expandable 10-section issue cards — each card opens a nested accordion: 📖 Narrative, 📄 Code Evidence (Prism.js-highlighted), 👥 Raised by, 🔍 Verification Trail, 💬 Debate, ⚖️ Judge Ruling, 🛠️ Fix Recommendation, 🔗 Cross-references, 🏷️ Epistemic Tags, 📊 Prior Runs. Deep-linkable (report.html#issue-A1), keyboard-navigable, print-friendly. Tailwind + Chart.js + Prism.js via CDN.

Advanced

VoltAgent integration — auto-upgrades personas to 130+ domain specialists across 10 families when installed; content signals route each domain (SQL, Terraform, React, …) to its matching expert. See Sharper reviews with VoltAgent
Multi-Run Union Protocol — invoke --runs N or "run 3 times and merge" to execute the panel N times with rotated persona compositions. Phase 16 merges findings by location + bug class, scores stability as [K/N RUNS], resolves judge divergence. Mitigates the ~30% single-run blind spot observed in early consistency analyses
Codebase state check — detects worktree/branch divergence to prevent false "missing code" findings
Tiered knowledge mining (L0 index / L1 summary / L2 full) — scans index lines first, only reads full content for relevant items
Deep research mode — opt-in web research for domain best practices

Usage Examples

hljs language-shell

> Review this implementation plan from multiple perspectives: docs/my_plan.md

> /roundtable:agent-review-panel

> Get a panel review of the authentication module — I want to stress-test the design

> Red team this deployment strategy

> Have agents debate whether this refactor is worth the complexity

> /roundtable:agent-review-panel deep   # adds web research for domain best practices

> Do a deep review of this ML pipeline

After a panel review of a plan document, integrate findings back into the plan with traceability:

hljs language-shell

> /roundtable:plan-review-integrator review_panel_report.md docs/my_plan.md

The integrator classifies each finding (apply / defer / reject), edits the plan in place, and produces a per-finding traceability summary.

Configuration / Modes

All modes are LLM-interpreted phrases — the skill's description matches them at invocation. There are no shell flags.

Mode	Invoke with	Effect
Default panel	`/roundtable:agent-review-panel` (or any "review panel" phrasing)	4-reviewer panel; auto-personas; auto Precise/Exhaustive
Deep research	`/roundtable:agent-review-panel deep` or "do a deep review of …"	Adds opt-in web research for domain best practices
Multi-run union	`--runs 3` or "run 3 times and merge"	Runs the panel N times with rotated personas; deduplicates findings; scores stability `[K/N RUNS]`
Trace tier	"use Thorough trace" / "use Exhaustive trace"	Data Flow Trace tier (Standard default / Thorough top-3 paths / Exhaustive all paths)
Custom personas	"include a Security Auditor and a Cost Modeler"	Overrides auto-persona detection; panel size still 4–6

Cost & Performance

Content type	Duration	Tokens	Approx $ at current Opus pricing
Docs / README review	~6 min	~150k	~$3
Plan document (small)	~6–8 min	~180k	~$5
Code (4-reviewer, auto Precise)	~8–10 min	~250k	~$10
Code + Exhaustive Data Flow Trace	~12–15 min	~350k	~$20
`--runs 3` multi-run union	3× base	3× base	3× base

Cost is driven by: reviewer count (4–6 auto-scales by signal density), content size, debate rounds, deep mode (adds web research), and --runs N (linear multiplier). Pricing assumes Anthropic's published Opus rate at time of run; check Anthropic pricing for current numbers.

Known Limitations

Same base model. All reviewers are Claude instances; unanimous agreement may reflect shared model biases. The correlated-bias warning flags this but cannot eliminate it. See Why a panel for the honest framing.
No runtime analysis. The panel reviews static code and documents. It cannot evaluate runtime behavior, production data patterns, or performance under load.
Token cost. Multi-agent review costs more than single-agent. Use for high-stakes reviews, not routine checks.
Temporal reasoning. Despite explicit checks, temporal scope verification (e.g., "excludes Christmas" with multi-year data) remains the hardest class of bug for panels to catch reliably.
Privacy & network. Reviewed content is subject to Claude Code's data-handling policy. The deep-research mode and Phase 11 web-verification step make outbound HTTPS requests; both can be skipped if you stay in default mode and review proprietary code. The plugin itself sends no telemetry; all three output files are written locally and never transmitted by the plugin.
Output location. All three files are written to your Claude Code session's current working directory and overwrite any prior review_panel_* files there. Filenames are not configurable. Run one panel at a time per directory.

Reading the Report

The panel produces three files per run; here's how to read them.

Vocabulary. This README uses these words consistently:

Term	Means
plugin	The marketplace package you install (`roundtable`)
skill	A shipped behavior loaded by Claude Code; this plugin bundles two (`agent-review-panel`, `plan-review-integrator`)
reviewer	One persona in the panel (e.g. Clarity Editor, Devil's Advocate). 4–6 reviewers per run
agent / subagent	The Claude Code launch mechanism used to spawn each reviewer in parallel — not a synonym for "reviewer" or "skill"
panel	The full set of reviewers participating in a single run
judge	The Phase 14 arbiter that ingests all reviewer + verification output and renders the final verdict

Severity rubric. Every finding is tagged with one severity:

P0 — ship-blocker. Affects correctness, data integrity, or security. Must fix before merging/releasing.
P1 — should fix soon. Real defect, real cost, but not a release-blocker.
P2 — nice to have. Stylistic, structural, or future-rot avoidance.
P3 — nit. Optional polish.

Epistemic labels. Each finding is also tagged with how confident the panel is in the underlying evidence:

Label	Meaning
`[VERIFIED]`	Confirmed against source — line citation matches actual code/text
`[CONSENSUS]`	Three or more reviewers raised it independently
`[SINGLE-SOURCE]`	One reviewer raised it, no one refuted
`[DISPUTED]`	Reviewers split on whether it's a real issue
`[UNVERIFIED]`	Stated without a checkable citation; treat with care
`[WEB-VERIFIED]` / `[WEB-CONTRADICTED]` / `[WEB-INCONCLUSIVE]`	External fact confirmed / contradicted / unconfirmed by an authoritative web source
`[CMD_CONFIRMED]` / `[CMD_CONTRADICTED]`	Reviewer's `verification_command` (read-only grep/cat/head) was run and the result confirmed/contradicted the claim
`[LIVE-VERIFIED]`	Live-infrastructure/runtime-state claim backed by `describe`-class command output (`gcloud describe`, `bq show`, `kubectl get`, etc.)
`[STATIC-INFERENCE]`	Live-state claim inferred from source/config/deploy scripts only — capped at P1
`[STATIC-INFERENCE-CONSENSUS]`	Multiple reviewers agreed off the same artifact lines — counts as one source, not independent verification
`[JUDGE-HALLUCINATED]`	The post-judge gate caught a finding the judge introduced that the panel never raised and ground-truth contradicted — heavily discount
`[COMPRESSED]`	Suffix on every action item when the pre-judge gate detected unrecoverable phase loss; the run is lower confidence — re-run for a full report

Defect type. Code/plan reviews additionally label findings as [EXISTING_DEFECT] (bug exists right now) or [PLAN_RISK] (risk only materializes if the plan is implemented as written). P0 severity requires [EXISTING_DEFECT] evidence.

How to read in priority order. Start with the Executive Summary, then the Action Items table (sorted P0 → P3). Use the HTML dashboard's filter bar to narrow by severity or epistemic label. Read the process history (review_panel_process.md) only when you need to see why a finding got a particular ruling. For findings that look wrong, scrutinise the epistemic label first — see TROUBLESHOOTING.md → A finding looks wrong.

Bundled skills

The roundtable plugin ships two skills in one install:

Skill	Source	What it does	When to use
`agent-review-panel`	`skills/agent-review-panel/`	Multi-agent adversarial review panel	When you need a structured review of code, plans, docs, or configs
`plan-review-integrator`	`skills/plan-review-integrator/`	Takes review panel output and integrates findings into a plan with traceability	After a panel review of a plan document

Both are activated by natural language or by /roundtable:agent-review-panel and /roundtable:plan-review-integrator.

Why is the plugin named roundtable and not agent-review-panel?

roundtable is a collective noun for the bundle. It also reads more naturally in slash-command form. plan-review-integrator was previously published as a standalone repo at wan-huiyan/plan-review-integrator — that repo is archived in favor of this bundle.

Tests

The test suite (443 tests) uses Node's built-in test runner — zero dependencies, requires Node ≥18:

hljs language-bash

npm test                    # run all 443 tests
npm run test:triggers       # trigger classification
npm run test:manifest       # manifest consistency + phase/opus enforcement
npm run test:eval-suite     # eval suite integrity
npm run test:report         # report structure validation
npm run test:behavioral     # behavioral assertion framework
npm run test:golden         # golden-file structural snapshots

Tests enforce key invariants: all 16 top-level phases (plus 13.5 / 14.5) present in SKILL.md; every subagent_type: launch co-occurs with model: "opus"; Phase 15.3 spec documents all 10 expandable-card sections; the canonical SKILL.md lives at skills/agent-review-panel/SKILL.md.

Updating

New releases land on main; Claude Code does not auto-pull. From your terminal:

hljs language-bash

claude plugin marketplace update agent-review-panel
claude plugin update roundtable@agent-review-panel

(REPL form: /plugin marketplace update agent-review-panel then /plugin update roundtable@agent-review-panel.)

Verify the update. The installed version is the directory name under ~/.claude/plugins/cache/agent-review-panel/roundtable/. Compare it against the canonical version in this repo's package.json or the top entry of CHANGELOG.md. Note that GitHub releases may lag main — package.json is the authoritative version source.

If the update appears to work but old behavior persists, a stale local clone may be shadowing the marketplace install — see TROUBLESHOOTING.md → Old version keeps loading.

Migration

If you installed before v3.0 or under an older marketplace handle (@wan-huiyan-agent-review-panel, @plugin, or the bare @agent-review-panel), see MIGRATION.md for the uninstall + clean-install recipe and a stale-state cleanup prompt you can paste into Claude Code.

Troubleshooting

Common problems and their fixes live in TROUBLESHOOTING.md. The most-hit entries:

After install, slash command not recognized — restart Claude Code first
No output files appeared — they go to the session's cwd
Old version keeps loading — a loose clone is shadowing the marketplace install
A finding looks wrong — read its epistemic label first

Support

Bug reports, questions, and feedback: open a GitHub issue. When filing a bug for a failed or weird review, include the content type (code/plan/docs), approximate size, and (if you can) the review_panel_report.md and review_panel_process.md files — they make repro practical.

Contributing

Contributions welcome. Especially useful:

Cursor adaptation — adapting the Agent tool calls to Cursor's subagent mechanism
New domain checklists — adding signal groups beyond the current 10
Benchmark cases — real-world review scenarios for the eval suite

Dev setup:

hljs language-bash

git clone https://github.com/wan-huiyan/agent-review-panel.git ~/projects/agent-review-panel-dev
cd ~/projects/agent-review-panel-dev
npm test                            # requires Node ≥18
bash scripts/release-check.sh       # pre-release doc-drift detector

The authoritative skill is skills/agent-review-panel/SKILL.md. Manifest invariants (phase count, force-opus, HTML-card schema, marketplace-name callout marker) are enforced by tests/manifest-consistency.test.mjs and scripts/release-check.sh — both must stay green. See HOW_WE_BUILT_THIS.md for the design journey. Open an issue to discuss before submitting large PRs.

Uninstalling

hljs language-bash

claude plugin uninstall roundtable@agent-review-panel
claude plugin marketplace remove agent-review-panel

(REPL form: /plugin uninstall roundtable@agent-review-panel and /plugin marketplace remove agent-review-panel.)

For a fully clean uninstall that removes cached state and any shadow clones, see MIGRATION.md → Stale-state cleanup. Manual-clone installs: rm -rf ~/.claude/skills/agent-review-panel.

Research foundations

Agent Review Panel is grounded in 9 research papers and projects on multi-agent debate and evaluation quality.

Mechanisms directly mapped to architecture:

ChatEval (ICLR 2024) → Phase 7 Blind Final — each reviewer commits a final verdict in writing before seeing peers' finals.
MachineSoM (ACL 2024) → Phase 4 Private Reflection — tf/ft/tt/ff conformity tracking makes Phase 5 position flips detectable as signal.
Trust or Escalate (ICLR 2025 Oral) → judge Verdict Confidence: High|Medium|Low field; low-confidence verdicts get ⚠️ HUMAN REVIEW RECOMMENDED.
DMAD (ICLR 2025) → distinct reasoning_strategy per persona injected into Phase 3 prompts.

Papers and projects that informed the design (without 1:1 architecture mapping): AutoGen (multi-agent orchestration patterns — note: AutoGen is a Microsoft research project, not a peer-reviewed paper), Du et al. (ICML 2024), DebateLLM, "Talk Isn't Always Cheap" (ICML 2025) for failure-mode analysis, CONSENSAGENT (ACL 2025) for sycophancy detection.

The cited papers validate multi-agent debate on reasoning benchmarks; this project has not independently benchmarked review quality. The full table with venues, links, and per-paper contributions lives in docs/research-foundations.md. Also inspired by MiroFish for heterogeneous-persona patterns. See ROADMAP.md for the research roadmap.

Version history

See CHANGELOG.md for detailed version history and ROADMAP.md for deferred items.

Version	Highlights
v3.4	VoltAgent catalog expansion — 30 new signal→specialist mappings (130+ agents / 10 families) + drift-detection automation (vendored catalog snapshot, CI gate)
v3.3	Live-State Claim Discipline — cross-cutting rule set for findings asserting facts about live infrastructure or runtime state
v3.2	Post-judge verification gate + Chart.js wrapper-div mandate — Phase 14.5 re-verifies judge-introduced P0/P1 against ground truth
v3.1	Silent-phase-compression fix — file-based subagent state under `state/`, Phase 13.5 Pre-Judge Verification Gate
v3.0	Single-plugin layout (BREAKING) — one plugin (`roundtable`) bundling both skills
v2.16	Plugin layout + marketplace bundle
v2.15	Expandable 10-section issue cards in HTML dashboard; Prism.js syntax highlighting; deep-linking
v2.14	Data Flow Trace (3 tiers); Multi-Run Union Protocol + Phase 16 Merge; force-opus on all launches
v2.13	Persona profiles in process history + Panel Gallery in HTML dashboard
v2.12	Triple output: primary + process history + interactive HTML dashboard
v2.11	Verification round: tiered persona-matched agents per dispute
v2.10	Codebase state check — prevents false "missing code" findings in worktrees
v2.9	VoltAgent specialist agent integration
v2.8	Auto Precise/Exhaustive mode, verification commands, tiered knowledge mining
v2.7	Severity verification, defect classification, temporal scope checks
v2.5	Trust layer: claim verification, epistemic labels, scope disclosure
v2.0	Completeness auditor, new discovery requirement
v1.0	Initial release: multi-agent review with debate and judge

License

MIT — Huiyan Wan. Reports you generate are yours; the HTML dashboard loads Tailwind, Chart.js, and Prism.js from CDN — all MIT-licensed.

Acknowledgements

Inspired by MiroFish — influenced auto-persona detection and persona-matched verification agents
Eval suite improved using schliff
See HOW_WE_BUILT_THIS.md for the design journey

External contributors — thank you!

@okuuva — #30 restructured the plugin to the canonical nested skills layout for Claude Code ≥2.1.112 manifest validation

Agent Review Panel

▶️ Play the interactive version — pause, replay, jump to any stage. Your code/plan journeys Gather → Review → Debate → Verify → Adjudicate → Report. (static pipeline diagram)

Runs only on Claude Code surfaces (CLI, IDE extension, Desktop "Code" tab) and the Claude Agent SDK. Does not work on claude.ai web chat or the raw Anthropic API — see Surfaces & Requirements.

Quick Start

hljs language-bash

claude plugin marketplace add wan-huiyan/agent-review-panel
claude plugin install roundtable@agent-review-panel

Restart your Claude Code session — skills load at session start.

Already inside a Claude Code session? Use the slash-command form

hljs language-bash

/plugin marketplace add wan-huiyan/agent-review-panel
/plugin install roundtable@agent-review-panel

Same effect. Shell form (claude plugin …) from the terminal; REPL form (/plugin …) from inside Claude Code.

hljs language-bash

claude plugin marketplace add VoltAgent/awesome-claude-code-subagents
claude plugin install voltagent-qa-sec@voltagent-subagents   # backs the core code-review personas

Use:

hljs language-shell

> Review this implementation plan from multiple perspectives: docs/my_plan.md

> /roundtable:agent-review-panel

Natural-language invocation also works ("red team this", "stress-test this design", "get a panel review of …").

What you get — three files written to your current working directory:

review_panel_report.md — executive summary, consensus, judge-resolved disagreements, action items
review_panel_process.md — verbatim "director's cut" of every reviewer's output
review_panel_report.html — interactive dashboard with expandable issue cards, charts, and a panel gallery

See Sample output below for what a real report looks like.

When to use it — and when not

The key signal is multiple independent perspectives — if one opinion would do, you're paying 4–6× more for nothing.

Why a panel, not a single reviewer?

Ask one reviewer "review this" and you get one perspective. It won't argue with itself, catch its own blind spots, or volunteer "I'm not sure about this."

Feasibility Analyst: "The data_available_through hardcoding is minor — it's documented."

Risk Assessor: "Disagree. If stale, the lookforward extends past actual data — model trains on incomplete outcomes — silent false-negative bias."

Feasibility Analyst (Round 2): "Valid point. I upgrade this to IMPORTANT."

Sample output

The panel was run against this very README on 2026-05-14. The interactive HTML dashboard:

Excerpt from the matching review_panel_report.md:

hljs language-markdown

# Review Panel Report — README.md

**Verdict:** REVISE — substantial edit needed | **Confidence:** Medium-High
**Score:** 5/10 | **Mode:** Exhaustive (pure documentation)

## Executive Summary

The README is factually careful where it counts — install commands, marketplace
handles, slash commands, the test-count claim, and spot-checked anchors all verify
cleanly against the repo. But three problems pull the score to 5/10: the
version/release story is incoherent and actively misleads every current
user, the document is roughly a third too long, and it never shows what a
review actually produces.

## Consensus Points (judge-confirmed)

- Version/release story is broken — all four reviewers, independently.
  package.json declares 3.3.0; the only git tags are up to v3.1.0. The
  "Updating" section tells users to verify against "the latest GitHub
  release" — which resolves to v3.1.0, so every correctly-installed
  v3.3.0 user concludes their install is wrong.  [VERIFIED][CONSENSUS]

## Action Items

| # | Severity | Action                                                       |
|---|----------|--------------------------------------------------------------|
| 1 | P1 [VERIFIED][CONSENSUS] | Fix the version/release story.            |
| 3 | P1 [CONSENSUS]           | Cut ~⅓ length; split audiences.           |
| 4 | P1 [VERIFIED]            | Add a text sample of a real report.       |

You're reading the rewrite that lands those findings. The full artifacts live at:

docs/reviews/2026-05-14-readme/review_panel_report.md — the full markdown report (13 action items)
docs/reviews/2026-05-14-readme/review_panel_report.html — the interactive HTML dashboard
docs/reviews/2026-05-14-readme/review_panel_process.md — verbatim process log with all four reviewers' raw output

Sharper reviews with VoltAgent specialists

Automatic, zero-config. During setup (Phase 1) the panel scans for installed voltagent-* agents and routes personas to them. No specialist installed for a given role? That reviewer falls back to a generic persona — nothing to configure, nothing breaks.
130+ specialists across 10 families — qa-sec (code review, security, performance, chaos), lang (20+ language experts), data-ai, infra, core-dev, and more. Content signals auto-add the right one: SQL → a database optimizer, Terraform → an IaC engineer, React → a frontend specialist. v3.4 added 30 such signal→specialist mappings, roughly doubling coverage.
Install what you review. voltagent-qa-sec backs the core code-review personas; add voltagent-lang / voltagent-data-ai / voltagent-infra for language-, ML-, or infra-heavy work. Commands are in Quick Start.

VoltAgent specialists install through Claude Code's plugin system — the same Claude Code surfaces the panel itself runs on (CLI / IDE / Desktop "Code" tab / Agent SDK). Codex isn't a supported surface for the panel: it needs Claude Code's Agent tool to spawn reviewers in parallel — see Surfaces & Requirements.

Surfaces & Requirements

This plugin needs the Claude Code Agent tool for parallel subagent spawning, local-filesystem access for output files, and a plugin loader.

Works ✅

CLI — claude in your terminal
VS Code extension — Claude Code extension
JetBrains IDE extension — IntelliJ, PyCharm, WebStorm, GoLand, Rider, etc.
Claude Desktop app → Code tab — the dedicated "Code" tab inside the Desktop app (docs)
Claude Agent SDK — load via options.plugins: [{ type: "local", path: "./agent-review-panel" }] (docs)

Does not work ❌

claude.ai web chat and Claude Desktop regular chat tabs — no /plugin marketplace; Agent Skills there can't spawn parallel subagents or write the output files
Anthropic Messages API called directly (without the Agent SDK) — no plugin loader, no subagent orchestration, no filesystem. Use the Agent SDK entry above.

Prerequisites in one place:

Claude Code v1.0+ on a supported surface, or the Claude Agent SDK
A Claude Pro / Max subscription or API access (Opus is the reviewer model)
Node ≥18 (only needed if running npm test or developing locally)
Recommended: VoltAgent specialist agents — the panel auto-upgrades reviewers to domain specialists (why it helps)

Don't have Claude Code yet? Install it from claude.ai/code, then come back to Quick Start.

Marketplace install (recommended)

Install commands are in Quick Start. The marketplace install handles caching, version tracking, and trigger-phrase activation in one step.

Command format: @<marketplace-name>, not @<repo-name>. The marketplace name is agent-review-panel (defined in .claude-plugin/marketplace.json); the plugin install name inside it is roundtable. Hence roundtable@agent-review-panel. If you installed under an older marketplace name, see MIGRATION.md.

Manual clone

For local development, forking, or air-gapped environments. Do NOT clone into ~/.claude/skills/ — that path shadows marketplace installs.

hljs language-bash

git clone https://github.com/wan-huiyan/agent-review-panel.git ~/projects/agent-review-panel-dev
claude --plugin-dir ~/projects/agent-review-panel-dev

Cursor (untested by maintainers)

Community-contributed starting points — see Contributing if you want to help

Per-project rule:

hljs language-bash

mkdir -p .cursor/rules
# Create .cursor/rules/agent-review-panel.mdc with the content of SKILL.md
# Add frontmatter: alwaysApply: true

Manual global install:

hljs language-bash

git clone https://github.com/wan-huiyan/agent-review-panel.git ~/.cursor/skills/agent-review-panel

The core pattern is straightforward — one subagent per reviewer in Phase 3, collect results, debate in Phase 5, verification and judge after. PRs welcome.

How It Works

A run executes 16 top-level phases (plus two verification gates — 13.5 and 14.5 — and an optional multi-run merge at Phase 16). The numbered phase table:

Stage	Phase	Action
Gather	1	Setup — scan sibling dirs, trace references, discover safeguards, detect signals, select personas
	2	Data Flow Trace (code only) — trace critical path(s), document schemas at each function boundary, flag composition/seam bugs
Review	3	Independent Review — 4–6 reviewers evaluate in parallel (no cross-talk)
	4	Private Reflection — each reviewer re-reads and rates own confidence
Debate	5	Adversarial Debate (1–3 rounds) — reviewers engage + find new issues
	6	Round Summarization — distill resolved/unresolved points between rounds
	7	Blind Final — each reviewer gives final score independently
Verify	8	Completeness Audit — dedicated agent scans for what the panel missed
	9	Verify Commands — run reviewer grep/read commands for P0/P1 findings (advisory)
	10	Claim Verification — verify all line-number citations against source
	11	Severity Verification — read actual code for every P0/P1; downgrade if overstated; web-verify external domain claims
	12	Tier Assignment — confidence-based draft → judge-advised refinement per dispute
	13	Targeted Verification — persona-matched agents investigate each dispute point
	13.5	Pre-Judge Verification Gate — orchestrator confirms all phase outputs exist on disk before launching the judge
Adjudicate	14	Supreme Judge — Opus arbitrates everything including verification-round evidence
	14.5	Post-Judge Verification Gate — re-verifies judge-introduced P0/P1 against ground truth (catches judge hallucinations)
Output	15	Triple output: Primary Report (`.md`) + Process History (`.md`) + Expandable-card Dashboard (`.html`)
Merge	16	Multi-Run Merge (optional) — deduplicate findings across runs, score stability, resolve judge divergence

Phase 15 has three sequential sub-steps (15.1 → 15.2 → 15.3) so the HTML agent can read the already-written markdown and process files from disk.

Choosing a review mode

You want	Use	Debate?
Fast eyes on a tiny PR	`/code-review`, single-persona review	no
Independent parallel lenses on small / autonomous multi-PR work	streamlined panel (no-debate)	no, by design
Adversarial tradeoff, high-stakes gating, a verdict debate would change (security vs perf, "is this P0 real", merge go/no-go)	this skill (full panel) — invoke as a skill, not a workflow	yes

Features

Review process

4–6 reviewers with distinct personas evaluate in parallel, then debate across 1–3 rounds
Auto-selects personas from content type (code, plan, docs, mixed) and 10 technology signal groups (SQL/Data, Auth/Security, Infrastructure, ML/Statistics, API/Integration, Frontend, Cost, Pipeline, Repo/Data Hygiene, Skill/Docs Portability)
Each reviewer uses a different reasoning strategy (systematic enumeration, adversarial simulation, backward reasoning, etc.)
Auto Precise/Exhaustive mode: code requires line citations; plans allow broader risk identification

Verification layer

Claim verification checks all reviewer citations against actual source code
Severity verification reads the codebase to confirm P0/P1 findings before the judge sees them. External domain claims (product limits, regulatory jurisdiction, API behavior) are automatically web-searched and tagged [WEB-VERIFIED], [WEB-CONTRADICTED], or [WEB-INCONCLUSIVE]
Verification commands: runs read-only grep/cat commands from reviewers to confirm or contradict claims
Defect classification: findings labeled [EXISTING_DEFECT] or [PLAN_RISK] — P0 requires existing defect evidence
Live-State Claim Discipline: findings asserting facts about live infrastructure or runtime state (deployed IAM, running crons, production env vars) must distinguish declarative config from imperative documentation from live describe-class output. Tagged [LIVE-VERIFIED] or [STATIC-INFERENCE] — the latter is capped at P1
Completeness audit: post-debate agent re-reads source line-by-line for what everyone missed
Targeted verification: each unresolved dispute gets a tiered (Light / Standard / Deep) verification agent matched to the claim type
Data Flow Trace (code only, three tiers): a dedicated agent traces data through critical paths before reviewers begin, flagging composition/seam bugs
Post-judge verification gate: re-verifies any P0/P1 the judge introduced (vs. raised by panel) against ground truth; hallucinated findings are demoted and the verdict score is recomputed
Force-opus: every subagent launch passes model: "opus" explicitly to eliminate cross-run reasoning variance

Anti-groupthink safeguards

Blind final scoring, private reflection, calibrated skepticism levels (20–60%)
Sycophancy detection intervenes when >50% of position changes lack new evidence
Anti-rhetoric assessment flags position changes driven by eloquence rather than evidence
Judge confidence gating: low-confidence verdicts flag ⚠️ HUMAN REVIEW RECOMMENDED
Correlated-bias warning when all reviewers converge (unanimous agreement is the most dangerous failure mode)

Outputs (three files per review)

Primary report (review_panel_report.md): executive summary, consensus, disagreements (with judge rulings), prioritized action items with epistemic labels — see Reading the Report.
Process history (review_panel_process.md): verbatim "director's cut" log of every agent's output, with persona profiles embedded.
Interactive HTML dashboard (review_panel_report.html) with expandable 10-section issue cards — each card opens a nested accordion: 📖 Narrative, 📄 Code Evidence (Prism.js-highlighted), 👥 Raised by, 🔍 Verification Trail, 💬 Debate, ⚖️ Judge Ruling, 🛠️ Fix Recommendation, 🔗 Cross-references, 🏷️ Epistemic Tags, 📊 Prior Runs. Deep-linkable (report.html#issue-A1), keyboard-navigable, print-friendly. Tailwind + Chart.js + Prism.js via CDN.

Advanced

VoltAgent integration — auto-upgrades personas to 130+ domain specialists across 10 families when installed; content signals route each domain (SQL, Terraform, React, …) to its matching expert. See Sharper reviews with VoltAgent
Multi-Run Union Protocol — invoke --runs N or "run 3 times and merge" to execute the panel N times with rotated persona compositions. Phase 16 merges findings by location + bug class, scores stability as [K/N RUNS], resolves judge divergence. Mitigates the ~30% single-run blind spot observed in early consistency analyses
Codebase state check — detects worktree/branch divergence to prevent false "missing code" findings
Tiered knowledge mining (L0 index / L1 summary / L2 full) — scans index lines first, only reads full content for relevant items
Deep research mode — opt-in web research for domain best practices

Usage Examples

hljs language-shell

> Review this implementation plan from multiple perspectives: docs/my_plan.md

> /roundtable:agent-review-panel

> Get a panel review of the authentication module — I want to stress-test the design

> Red team this deployment strategy

> Have agents debate whether this refactor is worth the complexity

> /roundtable:agent-review-panel deep   # adds web research for domain best practices

> Do a deep review of this ML pipeline

After a panel review of a plan document, integrate findings back into the plan with traceability:

hljs language-shell

> /roundtable:plan-review-integrator review_panel_report.md docs/my_plan.md

The integrator classifies each finding (apply / defer / reject), edits the plan in place, and produces a per-finding traceability summary.

Configuration / Modes

All modes are LLM-interpreted phrases — the skill's description matches them at invocation. There are no shell flags.

Mode	Invoke with	Effect
Default panel	`/roundtable:agent-review-panel` (or any "review panel" phrasing)	4-reviewer panel; auto-personas; auto Precise/Exhaustive
Deep research	`/roundtable:agent-review-panel deep` or "do a deep review of …"	Adds opt-in web research for domain best practices
Multi-run union	`--runs 3` or "run 3 times and merge"	Runs the panel N times with rotated personas; deduplicates findings; scores stability `[K/N RUNS]`
Trace tier	"use Thorough trace" / "use Exhaustive trace"	Data Flow Trace tier (Standard default / Thorough top-3 paths / Exhaustive all paths)
Custom personas	"include a Security Auditor and a Cost Modeler"	Overrides auto-persona detection; panel size still 4–6

Cost & Performance

Content type	Duration	Tokens	Approx $ at current Opus pricing
Docs / README review	~6 min	~150k	~$3
Plan document (small)	~6–8 min	~180k	~$5
Code (4-reviewer, auto Precise)	~8–10 min	~250k	~$10
Code + Exhaustive Data Flow Trace	~12–15 min	~350k	~$20
`--runs 3` multi-run union	3× base	3× base	3× base

Known Limitations

Same base model. All reviewers are Claude instances; unanimous agreement may reflect shared model biases. The correlated-bias warning flags this but cannot eliminate it. See Why a panel for the honest framing.
No runtime analysis. The panel reviews static code and documents. It cannot evaluate runtime behavior, production data patterns, or performance under load.
Token cost. Multi-agent review costs more than single-agent. Use for high-stakes reviews, not routine checks.
Temporal reasoning. Despite explicit checks, temporal scope verification (e.g., "excludes Christmas" with multi-year data) remains the hardest class of bug for panels to catch reliably.
Privacy & network. Reviewed content is subject to Claude Code's data-handling policy. The deep-research mode and Phase 11 web-verification step make outbound HTTPS requests; both can be skipped if you stay in default mode and review proprietary code. The plugin itself sends no telemetry; all three output files are written locally and never transmitted by the plugin.
Output location. All three files are written to your Claude Code session's current working directory and overwrite any prior review_panel_* files there. Filenames are not configurable. Run one panel at a time per directory.

Reading the Report

The panel produces three files per run; here's how to read them.

Vocabulary. This README uses these words consistently:

Term	Means
plugin	The marketplace package you install (`roundtable`)
skill	A shipped behavior loaded by Claude Code; this plugin bundles two (`agent-review-panel`, `plan-review-integrator`)
reviewer	One persona in the panel (e.g. Clarity Editor, Devil's Advocate). 4–6 reviewers per run
agent / subagent	The Claude Code launch mechanism used to spawn each reviewer in parallel — not a synonym for "reviewer" or "skill"
panel	The full set of reviewers participating in a single run
judge	The Phase 14 arbiter that ingests all reviewer + verification output and renders the final verdict

Severity rubric. Every finding is tagged with one severity:

P0 — ship-blocker. Affects correctness, data integrity, or security. Must fix before merging/releasing.
P1 — should fix soon. Real defect, real cost, but not a release-blocker.
P2 — nice to have. Stylistic, structural, or future-rot avoidance.
P3 — nit. Optional polish.

Epistemic labels. Each finding is also tagged with how confident the panel is in the underlying evidence:

Label	Meaning
`[VERIFIED]`	Confirmed against source — line citation matches actual code/text
`[CONSENSUS]`	Three or more reviewers raised it independently
`[SINGLE-SOURCE]`	One reviewer raised it, no one refuted
`[DISPUTED]`	Reviewers split on whether it's a real issue
`[UNVERIFIED]`	Stated without a checkable citation; treat with care
`[WEB-VERIFIED]` / `[WEB-CONTRADICTED]` / `[WEB-INCONCLUSIVE]`	External fact confirmed / contradicted / unconfirmed by an authoritative web source
`[CMD_CONFIRMED]` / `[CMD_CONTRADICTED]`	Reviewer's `verification_command` (read-only grep/cat/head) was run and the result confirmed/contradicted the claim
`[LIVE-VERIFIED]`	Live-infrastructure/runtime-state claim backed by `describe`-class command output (`gcloud describe`, `bq show`, `kubectl get`, etc.)
`[STATIC-INFERENCE]`	Live-state claim inferred from source/config/deploy scripts only — capped at P1
`[STATIC-INFERENCE-CONSENSUS]`	Multiple reviewers agreed off the same artifact lines — counts as one source, not independent verification
`[JUDGE-HALLUCINATED]`	The post-judge gate caught a finding the judge introduced that the panel never raised and ground-truth contradicted — heavily discount
`[COMPRESSED]`	Suffix on every action item when the pre-judge gate detected unrecoverable phase loss; the run is lower confidence — re-run for a full report

Bundled skills

The roundtable plugin ships two skills in one install:

Skill	Source	What it does	When to use
`agent-review-panel`	`skills/agent-review-panel/`	Multi-agent adversarial review panel	When you need a structured review of code, plans, docs, or configs
`plan-review-integrator`	`skills/plan-review-integrator/`	Takes review panel output and integrates findings into a plan with traceability	After a panel review of a plan document

Both are activated by natural language or by /roundtable:agent-review-panel and /roundtable:plan-review-integrator.

Why is the plugin named roundtable and not agent-review-panel?

Tests

The test suite (443 tests) uses Node's built-in test runner — zero dependencies, requires Node ≥18:

hljs language-bash

npm test                    # run all 443 tests
npm run test:triggers       # trigger classification
npm run test:manifest       # manifest consistency + phase/opus enforcement
npm run test:eval-suite     # eval suite integrity
npm run test:report         # report structure validation
npm run test:behavioral     # behavioral assertion framework
npm run test:golden         # golden-file structural snapshots

Updating

New releases land on main; Claude Code does not auto-pull. From your terminal:

hljs language-bash

claude plugin marketplace update agent-review-panel
claude plugin update roundtable@agent-review-panel

(REPL form: /plugin marketplace update agent-review-panel then /plugin update roundtable@agent-review-panel.)

If the update appears to work but old behavior persists, a stale local clone may be shadowing the marketplace install — see TROUBLESHOOTING.md → Old version keeps loading.

Migration

Troubleshooting

Common problems and their fixes live in TROUBLESHOOTING.md. The most-hit entries:

After install, slash command not recognized — restart Claude Code first
No output files appeared — they go to the session's cwd
Old version keeps loading — a loose clone is shadowing the marketplace install
A finding looks wrong — read its epistemic label first

Support

Contributing

Contributions welcome. Especially useful:

Cursor adaptation — adapting the Agent tool calls to Cursor's subagent mechanism
New domain checklists — adding signal groups beyond the current 10
Benchmark cases — real-world review scenarios for the eval suite

Dev setup:

hljs language-bash

git clone https://github.com/wan-huiyan/agent-review-panel.git ~/projects/agent-review-panel-dev
cd ~/projects/agent-review-panel-dev
npm test                            # requires Node ≥18
bash scripts/release-check.sh       # pre-release doc-drift detector

Uninstalling

hljs language-bash

claude plugin uninstall roundtable@agent-review-panel
claude plugin marketplace remove agent-review-panel

(REPL form: /plugin uninstall roundtable@agent-review-panel and /plugin marketplace remove agent-review-panel.)

For a fully clean uninstall that removes cached state and any shadow clones, see MIGRATION.md → Stale-state cleanup. Manual-clone installs: rm -rf ~/.claude/skills/agent-review-panel.

Research foundations

Agent Review Panel is grounded in 9 research papers and projects on multi-agent debate and evaluation quality.

Mechanisms directly mapped to architecture:

ChatEval (ICLR 2024) → Phase 7 Blind Final — each reviewer commits a final verdict in writing before seeing peers' finals.
MachineSoM (ACL 2024) → Phase 4 Private Reflection — tf/ft/tt/ff conformity tracking makes Phase 5 position flips detectable as signal.
Trust or Escalate (ICLR 2025 Oral) → judge Verdict Confidence: High|Medium|Low field; low-confidence verdicts get ⚠️ HUMAN REVIEW RECOMMENDED.
DMAD (ICLR 2025) → distinct reasoning_strategy per persona injected into Phase 3 prompts.

Version history

See CHANGELOG.md for detailed version history and ROADMAP.md for deferred items.

Version	Highlights
v3.4	VoltAgent catalog expansion — 30 new signal→specialist mappings (130+ agents / 10 families) + drift-detection automation (vendored catalog snapshot, CI gate)
v3.3	Live-State Claim Discipline — cross-cutting rule set for findings asserting facts about live infrastructure or runtime state
v3.2	Post-judge verification gate + Chart.js wrapper-div mandate — Phase 14.5 re-verifies judge-introduced P0/P1 against ground truth
v3.1	Silent-phase-compression fix — file-based subagent state under `state/`, Phase 13.5 Pre-Judge Verification Gate
v3.0	Single-plugin layout (BREAKING) — one plugin (`roundtable`) bundling both skills
v2.16	Plugin layout + marketplace bundle
v2.15	Expandable 10-section issue cards in HTML dashboard; Prism.js syntax highlighting; deep-linking
v2.14	Data Flow Trace (3 tiers); Multi-Run Union Protocol + Phase 16 Merge; force-opus on all launches
v2.13	Persona profiles in process history + Panel Gallery in HTML dashboard
v2.12	Triple output: primary + process history + interactive HTML dashboard
v2.11	Verification round: tiered persona-matched agents per dispute
v2.10	Codebase state check — prevents false "missing code" findings in worktrees
v2.9	VoltAgent specialist agent integration
v2.8	Auto Precise/Exhaustive mode, verification commands, tiered knowledge mining
v2.7	Severity verification, defect classification, temporal scope checks
v2.5	Trust layer: claim verification, epistemic labels, scope disclosure
v2.0	Completeness auditor, new discovery requirement
v1.0	Initial release: multi-agent review with debate and judge

License

MIT — Huiyan Wan. Reports you generate are yours; the HTML dashboard loads Tailwind, Chart.js, and Prism.js from CDN — all MIT-licensed.

Acknowledgements

Inspired by MiroFish — influenced auto-persona detection and persona-matched verification agents
Eval suite improved using schliff
See HOW_WE_BUILT_THIS.md for the design journey

External contributors — thank you!

@okuuva — #30 restructured the plugin to the canonical nested skills layout for Claude Code ≥2.1.112 manifest validation

agent-review-panel

Agent Review Panel

Quick Start

When to use it — and when not

Why a panel, not a single reviewer?

Sample output

Sharper reviews with VoltAgent specialists

Surfaces & Requirements

Marketplace install (recommended)

Manual clone

Cursor (untested by maintainers)

How It Works

Choosing a review mode

Features

Usage Examples

Configuration / Modes

Cost & Performance

Known Limitations

Reading the Report

Bundled skills

Tests

Updating

Migration

Troubleshooting

Support

Contributing

Uninstalling

Research foundations

Version history

License

Acknowledgements

Similar Packages

agent-review-panel

Agent Review Panel

Quick Start

When to use it — and when not

Why a panel, not a single reviewer?

Sample output

Sharper reviews with VoltAgent specialists

Surfaces & Requirements

Marketplace install (recommended)

Manual clone

Cursor (untested by maintainers)

How It Works

Choosing a review mode

Features

Usage Examples

Configuration / Modes

Cost & Performance

Known Limitations

Reading the Report

Bundled skills

Tests

Updating

Migration

Troubleshooting

Support

Contributing

Uninstalling

Research foundations

Version history

License

Acknowledgements

Similar Packages