菜单

🤖 系统
📄 Abstract - OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion

There has been significant progress in open-source text-only translation large language models (LLMs) with better language coverage and quality. However, these models can be only used in cascaded pipelines for speech translation (ST), performing automatic speech recognition first followed by translation. This introduces additional latency, which is particularly critical in simultaneous ST (SimulST), and prevents the model from exploiting multimodal context, such as images, which can aid disambiguation. Pretrained multimodal foundation models (MMFMs) already possess strong perception and reasoning capabilities across multiple modalities, but generally lack the multilingual coverage and specialized translation performance of dedicated translation LLMs. To build an effective multimodal translation system, we propose an end-to-end approach that fuses MMFMs with translation LLMs. We introduce a novel fusion strategy that connects hidden states from multiple layers of a pretrained MMFM to a translation LLM, enabling joint end-to-end training. The resulting model, OmniFusion, built on Omni 2.5-7B as the MMFM and SeedX PPO-7B as the translation LLM, can perform speech-to-text, speech-and-image-to-text, and text-and-image-to-text translation. Experiments demonstrate that OmniFusion effectively leverages both audio and visual inputs, achieves a 1-second latency reduction in SimulST compared to cascaded pipelines and also improves the overall translation quality\footnote{Code is available at this https URL}.

顶级标签: llm multi-modal natural language processing
详细标签: multimodal translation simultaneous translation model fusion speech translation multilingual 或 搜索:

OmniFusion:通过模块化融合实现同步多语言多模态翻译 / OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion


1️⃣ 一句话总结

这篇论文提出了一种名为OmniFusion的新模型,它通过创新的融合方法,将强大的多模态基础模型与专门的多语言翻译大模型结合起来,从而能够直接利用语音和图像等多种信息进行实时、高质量的翻译,比传统分步方法更快更好。


📄 打开原文 PDF