论惊奇理论中语言单位的正确处理 / On the Proper Treatment of Units in Surprisal Theory
1️⃣ 一句话总结
这篇论文揭示了惊奇理论研究中一个被忽视的问题:研究者通常用不同标准定义语言单位(如词与子词),导致实验结果不可靠,并提出了一个统一框架来明确分析单位和评估区域,从而让惊异度预测更科学、可重复。
Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a unit underspecified. In practice, experimental stimuli are segmented into linguistically motivated units (e.g., words), while pretrained language models assign probability mass to a fixed token alphabet that typically does not align with those units. As a result, surprisal-based predictors depend implicitly on ad hoc procedures that conflate two distinct modeling choices: the definition of the unit of analysis and the choice of regions of interest over which predictions are evaluated. In this paper, we disentangle these choices and give a unified framework for reasoning about surprisal over arbitrary unit inventories. We argue that surprisal-based analyses should make these choices explicit and treat tokenization as an implementation detail rather than a scientific primitive.
论惊奇理论中语言单位的正确处理 / On the Proper Treatment of Units in Surprisal Theory
这篇论文揭示了惊奇理论研究中一个被忽视的问题:研究者通常用不同标准定义语言单位(如词与子词),导致实验结果不可靠,并提出了一个统一框架来明确分析单位和评估区域,从而让惊异度预测更科学、可重复。
源自 arXiv: 2604.28147