菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-15
📄 Abstract - Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models

The development of robust language models for low-resource languages is frequently bottlenecked by the scarcity of high-quality, coherent, and domain-appropriate training corpora. In this paper, we introduce the Multilingual TinyStories dataset, a large-scale, synthetically generated collection of children's stories encompassing 17 Indian languages. Designed specifically for the training and evaluation of Small Language Models (SLMs), the corpus provides simple, narrative-driven text strictly localized to native scripts. We detail our hybrid curation pipeline, which leverages the Sarvam-M language model and a novel combinatorial prompt engineering framework for native generation, coupled with the Google Translate API for large-scale cross-lingual expansion. Through strict programmatic filtering, we compiled 132,942 stories and over 93.9 million tokens in our release, serving as a foundational resource for multilingual language modeling and transfer learning in the Indic linguistic sphere.

顶级标签: llm data natural language processing
详细标签: synthetic dataset multilingual corpora low-resource languages small language models indic languages 或 搜索:

多语言微型故事:一个用于训练小型语言模型的印度语儿童故事合成组合语料库 / Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models


1️⃣ 一句话总结

这篇论文创建了一个包含17种印度语言、由超过13万篇儿童故事组成的大型合成数据集,专门用于训练和评估资源匮乏语言的小型语言模型,以解决这些语言高质量训练数据稀缺的问题。

源自 arXiv: 2603.14563