模型回忆它们所违反的:多轮LLM构思中的约束遵循 / Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation
1️⃣ 一句话总结
本文通过构建DriftBench基准测试,发现大型语言模型在多轮科学构思迭代中会逐渐偏离原始约束,并揭示了一个关键矛盾:模型能准确回忆约束条件,却在实际生成中频繁违反它们,这种“知而犯之”的现象在不同模型和条件下普遍存在。
When researchers iteratively refine ideas with large language models, do the models preserve fidelity to the original objective? We introduce DriftBench, a benchmark for evaluating constraint adherence in multi-turn LLM-assisted scientific ideation. Across 2,146 scored benchmark runs spanning seven models from five providers (including two open-weight), four interaction conditions, and 38 research briefs from 24 scientific domains, we find that iterative pressure reliably increases structural complexity and often reduces adherence to original constraints. A restatement probe reveals a dissociation between declarative recall and behavioral adherence, as models accurately restate constraints they simultaneously violate. The knows-but-violates (KBV) rate, measuring constraint non-compliance despite preserved recall, ranges from 8% to 99% across models. Structured checkpointing partially reduces KBV rates but does not close the dissociation, and complexity inflation persists. Human validation against blind raters confirms that the LLM judge under-detects constraint violations, making reported constraint adherence scores conservative. Sensitivity analyses confirm the findings are robust to temperature (0.7 vs.\ 1.0) and pressure type (novelty vs.\ rigor). We release all briefs, prompts, rubrics, transcripts, and scores as an open benchmark.
模型回忆它们所违反的:多轮LLM构思中的约束遵循 / Models Recall What They Violate: Constraint Adherence in Multi-Turn LLM Ideation
本文通过构建DriftBench基准测试,发现大型语言模型在多轮科学构思迭代中会逐渐偏离原始约束,并揭示了一个关键矛盾:模型能准确回忆约束条件,却在实际生成中频繁违反它们,这种“知而犯之”的现象在不同模型和条件下普遍存在。
源自 arXiv: 2604.28031