幻觉作为一种异常:通过概率电路进行动态干预 / Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits
1️⃣ 一句话总结
本文提出一种新方法,通过训练一个概率电路来检测大语言模型内部状态中的异常(即幻觉),并仅在检测到幻觉时进行动态纠正,从而避免对已正确生成的文本造成破坏,在多个评测中实现了接近完美的幻觉检测和更低的文本破坏率。
One of the most critical challenges in Large Language Models is their tendency to hallucinate, i.e., produce factually incorrect responses. Existing approaches show promising results in terms of hallucination correction, but still suffer from a main limitation: they apply corrections indiscriminately to every token, corrupting also the originally correct generations. To overcome this drawback, we propose PCNET, a Probabilistic Circuit trained as a tractable density estimator over the LLM residual stream. The method detects hallucinations as geometric anomalies on the factual manifold, which is done via exact Negative Log-Likelihood computation, hence without the need for sampling, external verifiers, or weight modifications, as in existing techniques. To demonstrate its effectiveness, we exploit PCNET as a dynamic gate that distinguishes hallucinated from factual hidden states at each decoding step. This triggers our second main contribution, PC-LDCD (Probabilistic Circuit Latent Density Contrastive Decoding), only when the latent geometry deviates from factual regions, while leaving correct generations untouched. Across four LLMs, ranging from 1B to 8B models, and four benchmarks covering conversational reasoning, knowledge-intensive QA, reading comprehension, and truthfulness, PCNET achieves near-perfect hallucination detection across CoQA, SQuAD v2.0, and TriviaQA, with AUROC reaching up to 99%. Moreover, PC-LDCD obtains the highest True+Info, MC2, and MC3 scores on TruthfulQA in three out of four models, in comparison with state-of-the-art baselines, while reducing the mean corruption rate to 53.7% and achieving a preservation rate of 79.3%. Our proposed method is publicly available on GitHub.
幻觉作为一种异常:通过概率电路进行动态干预 / Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits
本文提出一种新方法,通过训练一个概率电路来检测大语言模型内部状态中的异常(即幻觉),并仅在检测到幻觉时进行动态纠正,从而避免对已正确生成的文本造成破坏,在多个评测中实现了接近完美的幻觉检测和更低的文本破坏率。
源自 arXiv: 2605.05953