菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-26
📄 Abstract - Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties

Recent strategies for low-resource machine translation rely on LLMs to generate synthetic data from higher-resource languages. We find that this method fails for Romansh, because LLMs tend to confuse its 6 distinct language varieties. Our experiments show that instead, the direction of data augmentation should be aligned with the resource gradient between source and target language. This approach surpasses Gemini 3 Pro in the lowest-resource variety of Romansh by 23 BLEU. A human evaluation confirms that our experiments yield the first model that generates fluent translations in the individual Romansh varieties.

顶级标签: llm natural language processing machine learning
详细标签: low-resource machine translation data augmentation synthetic data generation language varieties bleu evaluation 或 搜索:

大语言模型翻译不对称性作为数据增强因素:以6种罗曼什语变体为例的研究 / Translation Asymmetry in LLMs as a Data Augmentation Factor: A Case Study for 6 Romansh Language Varieties


1️⃣ 一句话总结

这篇论文发现,对于像罗曼什语这样包含多个变体的低资源语言,利用大语言模型进行数据增强时,必须根据源语言和目标语言之间的资源丰度差异来选择正确的翻译方向,而不是简单地从高资源语言生成数据,这种方法显著提升了最低资源变体的翻译质量。

源自 arXiv: 2603.25489