更多轮次,更多噪音:为何多轮审阅无法改进跨上下文验证 / More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification
1️⃣ 一句话总结
这项研究发现,在大语言模型进行内容验证时,让审阅者与作者进行多轮问答互动反而会降低整体准确性,因为额外的审阅轮次会引入大量误报,导致审阅重点从检查原始内容偏移到评判对话本身。
Cross-Context Review (CCR) improves LLM verification by separating production and review into independent sessions. A natural extension is multi-turn review: letting the reviewer ask follow-up questions, receive author responses, and review again. We call this Dynamic Cross-Context Review (D-CCR). In a controlled experiment with 30 artifacts and 150 injected errors, we tested four D-CCR variants against the single-pass CCR baseline. Single-pass CCR (F1 = 0.376) significantly outperformed all multi-turn variants, including D-CCR-2b with question-and-answer exchange (F1 = 0.303, $p < 0.001$, $d = -0.59$). Multi-turn review increased recall (+0.08) but generated 62% more false positives (8.5 vs. 5.2), collapsing precision from 0.30 to 0.20. Two mechanisms drive this degradation: (1) false positive pressure -- reviewers in later rounds fabricate findings when the artifact's real errors have been exhausted, and (2) Review Target Drift -- reviewers provided with prior Q&A exchanges shift from reviewing the artifact to critiquing the conversation itself. Independent re-review without prior context (D-CCR-2c) performed worst (F1 = 0.263), confirming that mere repetition degrades rather than helps. The degradation stems from false positive pressure in additional rounds, not from information amount -- within multi-turn conditions, more information actually helps (D-CCR-2b > D-CCR-2a). The problem is not what the reviewer sees, but that reviewing again invites noise.
更多轮次,更多噪音:为何多轮审阅无法改进跨上下文验证 / More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification
这项研究发现,在大语言模型进行内容验证时,让审阅者与作者进行多轮问答互动反而会降低整体准确性,因为额外的审阅轮次会引入大量误报,导致审阅重点从检查原始内容偏移到评判对话本身。
源自 arXiv: 2603.16244