BOOM:超越单一模态——KIT的多模态多语言讲座伴侣 / BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion
1️⃣ 一句话总结
这篇论文提出了一个名为BOOM的系统,它能够同时翻译讲座的音频和幻灯片,生成同步的文本、图像和语音输出,旨在为全球学生提供完整、可访问的多语言学习体验。
The globalization of education and rapid growth of online learning have made localizing educational content a critical challenge. Lecture materials are inherently multimodal, combining spoken audio with visual slides, which requires systems capable of processing multiple input modalities. To provide an accessible and complete learning experience, translations must preserve all modalities: text for reading, slides for visual understanding, and speech for auditory learning. We present \textbf{BOOM}, a multimodal multilingual lecture companion that jointly translates lecture audio and slides to produce synchronized outputs across three modalities: translated text, localized slides with preserved visual elements, and synthesized speech. This end-to-end approach enables students to access lectures in their native language while aiming to preserve the original content in its entirety. Our experiments demonstrate that slide-aware transcripts also yield cascading benefits for downstream tasks such as summarization and question answering. We release our Slide Translation code at this https URL and integrate it in Lecture Translator at this https URL}\footnote{All released code and models are licensed under the MIT License.
BOOM:超越单一模态——KIT的多模态多语言讲座伴侣 / BOOM: Beyond Only One Modality KIT's Multimodal Multilingual Lecture Companion
这篇论文提出了一个名为BOOM的系统,它能够同时翻译讲座的音频和幻灯片,生成同步的文本、图像和语音输出,旨在为全球学生提供完整、可访问的多语言学习体验。