Samasāmayik:一个用于印地语-梵语机器翻译的平行数据集 / Samasāmayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation
1️⃣ 一句话总结
这篇论文发布了一个名为Samasāmayik的大规模、新颖的印地语-梵语平行数据集,该数据集专注于当代内容,并通过实验证明它能显著提升机器翻译模型在相关领域的性能,为低资源印度语言翻译提供了宝贵的新资源。
We release Samasāmayik, a novel, meticulously curated, large-scale Hindi-Sanskrit corpus, comprising 92,196 parallel sentences. Unlike most data available in Sanskrit, which focuses on classical era text and poetry, this corpus aggregates data from diverse sources covering contemporary materials, including spoken tutorials, children's magazines, radio conversations, and instruction materials. We benchmark this new dataset by fine-tuning three complementary models - ByT5, NLLB and IndicTrans-v2, to demonstrate its utility. Our experiments demonstrate that models trained on the Samasamayik corpus achieve significant performance gains on in-domain test data, while achieving comparable performance on other widely used test sets, establishing a strong new performance baseline for contemporary Hindi-Sanskrit translation. Furthermore, a comparative analysis against existing corpora reveals minimal semantic and lexical overlap, confirming the novelty and non-redundancy of our dataset as a robust new resource for low-resource Indic language MT.
Samasāmayik:一个用于印地语-梵语机器翻译的平行数据集 / Samasāmayik: A Parallel Dataset for Hindi-Sanskrit Machine Translation
这篇论文发布了一个名为Samasāmayik的大规模、新颖的印地语-梵语平行数据集,该数据集专注于当代内容,并通过实验证明它能显著提升机器翻译模型在相关领域的性能,为低资源印度语言翻译提供了宝贵的新资源。
源自 arXiv: 2603.24307