菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-04
📄 Abstract - MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization

Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.

顶级标签: natural language processing audio multi-modal
详细标签: speaker diarization speech transcription multimodal llm end-to-end meeting transcription 或 搜索:

MOSS转录与说话人分离:具备说话人归属和时间戳的精准转录 / MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization


1️⃣ 一句话总结

这篇论文提出了一个名为MOSS Transcribe Diarize的端到端多模态大语言模型,它能够同时、准确地识别会议等场景中谁在何时说了什么,并且在多项测试中超越了当前最先进的商业系统。

源自 arXiv: 2601.01554