← 返回列表

arXiv 提交日期: 2026-04-30

📄 Abstract - On the Proper Treatment of Units in Surprisal Theory

Surprisal theory links human processing effort to the predictability of an upcoming linguistic unit, but empirical work often leaves the notion of a unit underspecified. In practice, experimental stimuli are segmented into linguistically motivated units (e.g., words), while pretrained language models assign probability mass to a fixed token alphabet that typically does not align with those units. As a result, surprisal-based predictors depend implicitly on ad hoc procedures that conflate two distinct modeling choices: the definition of the unit of analysis and the choice of regions of interest over which predictions are evaluated. In this paper, we disentangle these choices and give a unified framework for reasoning about surprisal over arbitrary unit inventories. We argue that surprisal-based analyses should make these choices explicit and treat tokenization as an implementation detail rather than a scientific primitive.

顶级标签: natural language processing llm

论惊奇理论中语言单位的正确处理 / On the Proper Treatment of Units in Surprisal Theory

1️⃣ 一句话总结

这篇论文揭示了惊奇理论研究中一个被忽视的问题：研究者通常用不同标准定义语言单位（如词与子词），导致实验结果不可靠，并提出了一个统一框架来明确分析单位和评估区域，从而让惊异度预测更科学、可重复。

👋 没兴趣 ☆ 感兴趣 📌 待读

打开原文 PDF

源自 arXiv: 2604.28147

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要