菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-09
📄 Abstract - TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation

Large language models often underperform in many European languages due to the dominance of English and a few high-resource languages in training data. This paper presents TildeOpen LLM, a 30-billion-parameter open-weight foundational model trained for 34 European languages to promote linguistic equity and improve performance for low-resource languages. To address the data imbalance, we combine dataset upsampling with a curriculum-based training schedule that alternates between uniform and natural language distributions. The resulting model performs favorably compared to other multilingual LLMs despite being trained with significantly fewer computing resources. Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages. Human evaluations confirm an up to tenfold reduction in linguistic errors relative to leading baselines. The model and associated resources are fully open-weight and publicly available at this http URL. These outcomes demonstrate that careful data curation and balanced training strategies can substantially enhance multilingual model quality without increasing model size or training volume.

顶级标签: llm natural language processing model training
详细标签: multilingual llm curriculum learning low-resource languages data imbalance model evaluation 或 搜索:

TildeOpen LLM:利用课程学习实现公平的语言表征 / TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation


1️⃣ 一句话总结

这篇论文提出了一个名为TildeOpen LLM的300亿参数开源大语言模型,它通过巧妙的数据增广和课程学习训练策略,显著提升了34种欧洲语言(尤其是低资源语言)的处理能力,在减少计算资源消耗的同时实现了更公平的多语言性能。

源自 arXiv: 2603.08182