TokSuite:衡量分词器选择对语言模型行为的影响 / TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior
1️⃣ 一句话总结
这篇论文通过构建一个包含不同分词器的统一模型套件和新基准测试,揭示了分词器选择如何显著影响语言模型的性能和表现,为理解和选择合适的分词器提供了实证依据。
Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.
TokSuite:衡量分词器选择对语言模型行为的影响 / TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior
这篇论文通过构建一个包含不同分词器的统一模型套件和新基准测试,揭示了分词器选择如何显著影响语言模型的性能和表现,为理解和选择合适的分词器提供了实证依据。
源自 arXiv: 2512.20757