Mercer8964/audit-loop

GitHub: Mercer8964/audit-loop

Stars: 0 | Forks: 0

# audit-loop 中文简介:跨 Claude Code / Codex / OpenClaw 的 AI 自审 skill。主 agent 在给出高风险答案(算法 / 机制 / 数字 / 正确性声明)前,spawn 一个 subagent 在**不看草稿**的前提下独立重解同一个问题,再 spawn 一个跨方法 probe,三路机械对照。每条设计决定都有论文背书,对协议局限性诚实承认。 ## The problem When an AI agent produces a high-stakes answer — algorithm correctness, mechanism design, a numeric estimate, a "this is safe / optimal" claim — and you can't simply run a test, the conventional wisdom is *"just ask the AI to check itself."* Empirical research consistently shows this **fails structurally**, not just occasionally: - **Refinement-aware bias** — same content scored higher when labeled "revised" - **CoT trust** — judges believe shown reasoning traces as ground truth (false-positive rate up to 90%) - **Sycophancy** — multi-turn pushback flips answers ~3× more than direct questioning - **Self-preference / perplexity bias** — models systematically under-flag errors typical of their own training distribution - **Answer wavering** — multi-round critique echo-chambers instead of converging - **Intrinsic self-correction** *degrades* reasoning accuracy on average (Huang et al., ICLR 2024) Critique-of-draft is structurally broken. The structural fix that survives the literature: **independent re-solve, then mechanically compare**. This mirrors what works in mature human audit domains: reperformance > inquiry in financial audit; replication > peer review; kernel-check > read-the-proof. Step-checking inherits the auditee's blind spots. ## The protocol **`audit-loop`** (default, budget-balanced — handles ~95% of cases): 1. **Triage** — empirically testable? Run the test instead. Trivial? Skip. Otherwise continue. 2. **Characterize** — internally name the CLAIM and its FALSIFICATION SHAPE (what would prove it wrong). 3. **Spawn 1 — Independent re-solve.** A subagent solves the original problem from scratch with no view of the draft, no reasoning, no audit framing. Just "solve." 4. **Spawn 2 — Cross-method probe.** A different subagent attacks the falsification shape directly: trace on edge inputs, search for counterexample, recompute via alternative method. 5. **Mechanical comparison.** Compare draft, re-solve, and probe via documented equivalence rules. Default to "disagreement" when unsure. 6. **Report honestly.** Disagreement surfaces in the audit line, never silently picked. **`audit-loop-max`** (accuracy-optimal — for security-critical / irreversible / material-consequence decisions): - 3-5 parallel independent re-solves (varied prompt approaches, cross-family if available) - 2-3 parallel cross-method probes (different falsification angles) - Du-et-al multi-agent debate on persistent disagreement - No spawn cap (typical pool 5-8, up to 14) - Cross-family verification required where available ## Platforms | Platform | Default skill | Max-accuracy skill | |---|---|---| | Claude Code | `~/.claude/skills/audit-loop/SKILL.md` | `~/.claude/skills/audit-loop-max/SKILL.md` | | Codex CLI | `~/.agents/skills/audit-loop/SKILL.md` | `~/.agents/skills/audit-loop-max/SKILL.md` | | OpenClaw | `~/.openclaw/skills/audit-loop/SKILL.md` | `~/.openclaw/skills/audit-loop-max/SKILL.md` | All three platforms implement the open agent skills standard (frontmatter + markdown body), with platform-specific subagent invocation: - Claude Code: `Agent` tool with `subagent_type=general-purpose` - Codex: explicit subagent spawn (optionally via custom `auditor.toml` agent) - OpenClaw: `sessions_spawn` + `sessions_yield`, `context: "isolated"` ## Installation git clone https://github.com/guoyurui138-hue/audit-loop.git cd audit-loop # Claude Code mkdir -p ~/.claude/skills/audit-loop ~/.claude/skills/audit-loop-max cp platforms/claude-code/audit-loop/SKILL.md ~/.claude/skills/audit-loop/ cp platforms/claude-code/audit-loop-max/SKILL.md ~/.claude/skills/audit-loop-max/ # Codex CLI mkdir -p ~/.agents/skills/audit-loop ~/.agents/skills/audit-loop-max cp platforms/codex/audit-loop/SKILL.md ~/.agents/skills/audit-loop/ cp platforms/codex/audit-loop-max/SKILL.md ~/.agents/skills/audit-loop-max/ # OpenClaw mkdir -p ~/.openclaw/skills/audit-loop ~/.openclaw/skills/audit-loop-max cp platforms/openclaw/audit-loop/SKILL.md ~/.openclaw/skills/audit-loop/ cp platforms/openclaw/audit-loop-max/SKILL.md ~/.openclaw/skills/audit-loop-max/ Skills auto-trigger when the agent is about to make a claim matching the description (algorithm correctness, mechanism design, non-empirical numeric estimate, safety/correctness assertion). Or invoke manually with `/audit-loop` or `/audit-loop-max`. ## What this does NOT promise This protocol is **deliberately honest about its limits.** Most "I built an AI agent that improves X by 80%" claims are uncited folklore. This one specifies what it cannot do: - **Reduces error rate; does not eliminate it.** Same-family verifiers share weights, share training data, share blind spots no protocol can fully escape. - **Mathematical floor on correlated-verifier accuracy.** For pairwise correlation ρ > 0, ensemble error converges to a positive constant `Φ(Φ⁻¹(1−α)/√ρ)` — adding verifiers cannot drive error to zero (Don't Always Pick, arXiv:2602.08003). - **Cross-family is bounded.** Eliminates judge bias (preference leakage drops from 28-37% to ~±1.5%) but only halves error correlation (same-family ρ ~0.7-0.8 → cross-family ~0.4-0.5). Capability is a bigger driver of correlation than vendor — two strong models from different vendors can agree on errors at 0.99+ (Correlated Errors, ICML 2025). - **For empirically testable claims, this is inferior to running the test.** The triage gate exists so you don't substitute theory for measurement. - **Design-type problems are the most degraded mode.** Failure-mode enumeration shares the same-family blind spots in the worst way — missing modes are the actually-dangerous ones, and same-family agents miss the same ones the main agent missed. - **Frontier-novel claims, self-consistent fabrications, and aesthetic judgments** are explicit bypass cases — the protocol reports degraded value in those. Full limits are documented in each `SKILL.md`. ## Empirical grounding Every design choice has a paper citation in the `SKILL.md`. Headlines: **Why re-solve, not critique:** - McAleese et al., 2024 — *LLM Critics Help Catch LLM Bugs* (CriticGPT) — https://arxiv.org/abs/2407.00215 - Huang et al., ICLR 2024 — *Large Language Models Cannot Self-Correct Reasoning Yet* — https://arxiv.org/abs/2310.01798 - Ye et al., 2024 — *Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge* — https://arxiv.org/html/2410.02736v1 - SycEval — *Evaluating LLM Sycophancy* — https://arxiv.org/html/2502.08177v4 **Cross-model error correlation & cross-family limits:** - Kim et al., ICML 2025 — *Correlated Errors in Large Language Models* — https://arxiv.org/abs/2506.07962 - Li et al., ICLR 2026 — *Preference Leakage in LLM-as-a-judge* — https://arxiv.org/abs/2502.01534 - *Don't Always Pick the Highest-Performing Model* (ensemble error floor) — https://arxiv.org/abs/2602.08003 **Method diversity > sample diversity:** - Lifshitz et al., 2025 — *BoN-MAV: Multi-Agent Verification* — https://arxiv.org/abs/2502.20379 - Naik et al., 2023 — *Diversity of Thought* — https://arxiv.org/abs/2310.07088 - Wang et al., 2022 — *Self-Consistency* — https://arxiv.org/abs/2203.11171 - Du et al., 2023 — *Multi-Agent Debate* — https://arxiv.org/abs/2305.14325 **Negative-prompting / priming failure:** - Rana, 2026 — *Semantic Gravity Wells* — https://arxiv.org/pdf/2601.08070 **Cross-domain audit principles** (reperformance > inquiry, pre-registration, de Bruijn criterion): PCAOB AS 2315; Cochrane Handbook; NTSB Annex 13; Bazerman et al. 2002 on auditor capture; replication-crisis literature on Registered Reports. ## Design choices flagged as "hypothesis, not yet empirically tested" In the interest of not overclaiming, these choices are marked in the `SKILL.md` as **defensible-but-not-proven**, awaiting head-to-head studies: - Probe-on-agreement vs probe-on-disagreement allocation (we do probe regardless, but the comparative value is speculative). - The 2-spawn hard cap as an *accuracy* claim (it's defensible as a budget claim; literature would support 6+ for accuracy-optimal). If you have empirical data that addresses these, please open an issue. ## License MIT