📄
Abstract - The Origins of Stochasticity: Comprehensive Investigations on Uncertainty Quantification for Large Language Models
Recent advancements in Large Language Models (LLMs) have enabled sophisticated reasoning and content generation, yet their inherent stochasticity poses significant challenges for ensuring predictive credibility. While traditional uncertainty taxonomy paradigms, such as the dichotomy of aleatoric and epistemic uncertainties, provide conceptual foundations, they often fail to capture the multi-component and multi-stage nature of LLM generation and struggle to evaluate the effectiveness of various Uncertainty Quantification (UQ) methods. In this paper, we propose a granular uncertainty taxonomy that systematically attributes LLM uncertainty into input-level, parameter-level, token-level, and decoding-process sources. Correspondingly, we categorize existing UQ methods into Bayesian, ensemble, consensus-based, and single-pass approaches. Furthermore, we introduce a comprehensive evaluation framework covering diverse generation settings and metrics. We empirically evaluate 21 typical UQ methods across three prominent LLM families, including Qwen3, Llama 3.2, and DeepSeek-V3, on benchmarks such as TriviaQA, GSM8K, and HumanEval. Our experimental results demonstrate that (i) the effectiveness of UQ methods is sensitive to task types and generation settings; (ii) consensus-based methods, typed Deg and EigV, consistently outperform other UQ approaches; and (iii) larger model scales correlate with lower uncertainty estimates, suggesting an empirical scaling law for LLM uncertainty. This work bridges the gap between theoretical origins and practical deployment, providing a versatile diagnostic tool for systematically quantifying uncertainty in LLM applications.
随机性的起源:大型语言模型不确定性量化的综合研究 /
The Origins of Stochasticity: Comprehensive Investigations on Uncertainty Quantification for Large Language Models
1️⃣ 一句话总结
这篇论文提出了一套更精细的不确定性分类体系,将大语言模型的不确定性拆解为输入、参数、词元和解码过程四个来源,并据此评估了21种主流量化方法,发现基于共识的方法(如Deg和Eig)效果最好,且模型越大不确定性越低,相当于发现了不确定性随模型规模变化的经验规律。