菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-17
📄 Abstract - Bolmo: Byteifying the Next Generation of Language Models

We introduce Bolmo, the first family of competitive fully open byte-level language models (LMs) at the 1B and 7B parameter scales. In contrast to prior research on byte-level LMs, which focuses predominantly on training from scratch, we train Bolmo by byteifying existing subword-level LMs. Byteification enables overcoming the limitations of subword tokenization - such as insufficient character understanding and efficiency constraints due to the fixed subword vocabulary - while performing at the level of leading subword-level LMs. Bolmo is specifically designed for byteification: our architecture resolves a mismatch between the expressivity of prior byte-level architectures and subword-level LMs, which makes it possible to employ an effective exact distillation objective between Bolmo and the source subword model. This allows for converting a subword-level LM to a byte-level LM by investing less than 1\% of a typical pretraining token budget. Bolmo substantially outperforms all prior byte-level LMs of comparable size, and outperforms the source subword-level LMs on character understanding and, in some cases, coding, while coming close to matching the original LMs' performance on other tasks. Furthermore, we show that Bolmo can achieve inference speeds competitive with subword-level LMs by training with higher token compression ratios, and can be cheaply and effectively post-trained by leveraging the existing ecosystem around the source subword-level LM. Our results finally make byte-level LMs a practical choice competitive with subword-level LMs across a wide set of use cases.

顶级标签: llm model training natural language processing
详细标签: byte-level language models tokenization knowledge distillation model compression character understanding 或 搜索:

Bolmo:将下一代语言模型字节化 / Bolmo: Byteifying the Next Generation of Language Models


1️⃣ 一句话总结

这篇论文提出了一个名为Bolmo的新方法,它通过一种高效的“字节化”技术,将现有的基于子词的语言模型转换成基于字节的模型,从而在保持高性能的同时,解决了传统子词模型在字符理解和效率上的局限,并且转换成本极低。

源自 arXiv: 2512.15586