菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-17
📄 Abstract - More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification

Cross-Context Review (CCR) improves LLM verification by separating production and review into independent sessions. A natural extension is multi-turn review: letting the reviewer ask follow-up questions, receive author responses, and review again. We call this Dynamic Cross-Context Review (D-CCR). In a controlled experiment with 30 artifacts and 150 injected errors, we tested four D-CCR variants against the single-pass CCR baseline. Single-pass CCR (F1 = 0.376) significantly outperformed all multi-turn variants, including D-CCR-2b with question-and-answer exchange (F1 = 0.303, $p < 0.001$, $d = -0.59$). Multi-turn review increased recall (+0.08) but generated 62% more false positives (8.5 vs. 5.2), collapsing precision from 0.30 to 0.20. Two mechanisms drive this degradation: (1) false positive pressure -- reviewers in later rounds fabricate findings when the artifact's real errors have been exhausted, and (2) Review Target Drift -- reviewers provided with prior Q&A exchanges shift from reviewing the artifact to critiquing the conversation itself. Independent re-review without prior context (D-CCR-2c) performed worst (F1 = 0.263), confirming that mere repetition degrades rather than helps. The degradation stems from false positive pressure in additional rounds, not from information amount -- within multi-turn conditions, more information actually helps (D-CCR-2b > D-CCR-2a). The problem is not what the reviewer sees, but that reviewing again invites noise.

顶级标签: llm model evaluation natural language processing
详细标签: cross-context review verification multi-turn interaction false positives benchmark 或 搜索:

更多轮次,更多噪音:为何多轮审阅无法改进跨上下文验证 / More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification


1️⃣ 一句话总结

这项研究发现,在大语言模型进行内容验证时,让审阅者与作者进行多轮问答互动反而会降低整体准确性,因为额外的审阅轮次会引入大量误报,导致审阅重点从检查原始内容偏移到评判对话本身。

源自 arXiv: 2603.16244