菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-27
📄 Abstract - Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in architecture, training data and recipes. In particular, Nemotron 3 delivers leading results in real-world document understanding, long audio-video comprehension, and agentic computer use. Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni further incorporates innovative multimodal token-reduction techniques to deliver substantially lower inference latency and higher throughput than other models of similar size. We are releasing model checkpoints in BF16, FP8, and FP4 formats, along with portions of the training data and codebase to facilitate further research and development.

顶级标签: multi-modal model training model evaluation
详细标签: audio input multimodal understanding inference efficiency open-source 或 搜索:

Nemotron 3 Nano Omni:高效且开放的多模态人工智能 / Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence


1️⃣ 一句话总结

本文介绍了Nemotron 3 Nano Omni模型,它在支持文本、图像和视频的基础上首次原生集成音频输入,通过架构创新和数据优化在文档理解、长音视频理解和智能体计算机使用等任务上取得领先性能,并采用高效的30B-A3B骨干网络和模态令牌压缩技术,大幅降低推理延迟、提升吞吐量,同时开源多种精度的模型权重及部分训练数据和代码。

源自 arXiv: 2604.24954