菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-23
📄 Abstract - Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation

In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine similarity-to improve data quality. Finally, we employ a two-stage fine-tuning strategy: first fine-tuning on the abundant sentence-level MT resources, and then on the filtered document-level corpus.

顶级标签: llm natural language processing model training
详细标签: machine translation document-level translation data augmentation fine-tuning synthetic data 或 搜索:

通过过滤合成语料库与两阶段大语言模型适配增强文档级机器翻译 / Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation


1️⃣ 一句话总结

这篇论文提出了一种两阶段微调方法,通过大语言模型生成并过滤出高质量的文档级翻译数据,以解决大语言模型在文档翻译中数据稀缺和容易产生幻觉或遗漏的问题,从而提升其翻译效果。

源自 arXiv: 2603.22186