菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-25
📄 Abstract - Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment

Multilingual pretraining typically lacks explicit alignment signals, leading to suboptimal cross-lingual alignment in the representation space. In this work, we show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus in a diverse pool of languages can substantially improve multilingual and cross-lingual representations for NLU tasks. We construct a multi-way parallel dataset using translations of English text from an off-the-shelf NMT model for a pool of six target languages and achieve strong cross-lingual alignment through contrastive learning. This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models. Using a multi-way parallel corpus for contrastive training yields substantial gains on bitext mining (21.3%), semantic similarity (5.3%), and classification (28.4%) compared to English-centric (En-X) bilingually parallel data, where X is sampled from a pool of multiple target languages. Furthermore, finetuning mE5 model on a small dataset with multi-way parallelism significantly improves bitext mining compared to one without, underscoring the importance of multi-way cross-lingual supervision even for models already pretrained for high-quality sentence embeddings.

顶级标签: natural language processing model training machine learning
详细标签: multilingual embeddings contrastive learning cross-lingual alignment parallel corpus sentence representations 或 搜索:

通过多路平行文本对齐增强多语言嵌入 / Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment


1️⃣ 一句话总结

这篇论文提出,使用多语言平行语料库进行对比学习,能显著提升多语言模型的跨语言对齐能力,从而在多种自然语言理解任务上取得比传统双语平行数据更好的性能。

源自 arXiv: 2602.21543