菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-30
📄 Abstract - On the Proper Treatment of Units in Surprisal Theory

Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a unit underspecified. In practice, experimental stimuli are segmented into linguistically motivated units (e.g., words), while pretrained language models assign probability mass to a fixed token alphabet that typically does not align with those units. As a result, surprisal-based predictors depend implicitly on ad hoc procedures that conflate two distinct modeling choices: the definition of the unit of analysis and the choice of regions of interest over which predictions are evaluated. In this paper, we disentangle these choices and give a unified framework for reasoning about surprisal over arbitrary unit inventories. We argue that surprisal-based analyses should make these choices explicit and treat tokenization as an implementation detail rather than a scientific primitive.

顶级标签: natural language processing llm
详细标签: surprisal theory tokenization unit of analysis psycholinguistics language models 或 搜索:

论惊奇理论中语言单位的正确处理 / On the Proper Treatment of Units in Surprisal Theory


1️⃣ 一句话总结

这篇论文揭示了惊奇理论研究中一个被忽视的问题:研究者通常用不同标准定义语言单位(如词与子词),导致实验结果不可靠,并提出了一个统一框架来明确分析单位和评估区域,从而让惊异度预测更科学、可重复。

源自 arXiv: 2604.28147