Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context

📄 Abstract - Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context

Diffusion language models offer a promising alternative to autoregressive models due to their potential for parallel and iterative generation. However, existing approaches use a single network for both context representation and iterative denoising, forcing one model to serve both roles and limiting its capacity for either role. We propose TwoTower, a block-wise autoregressive diffusion model that decouples these roles into two towers: a frozen AR context tower that causally processes clean tokens, and a trainable diffusion denoiser tower with bidirectional block attention that refines noisy blocks via cross-attention to the context. Built on Nemotron-3-Nano-30B-A3B, an open-weight 30B hybrid Mamba-Transformer MoE model, and trained on approximately 2.1T tokens, Nemotron-TwoTower retains 98.7% of the autoregressive baseline's quality while offering 2.42X higher wall-clock generation throughput. We release the code and model weights at this https URL.

Nemotron双塔模型：利用预训练自回归上下文的扩散语言建模 / Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context

1️⃣ 一句话总结

本文提出了一种名为TwoTower的双塔扩散语言模型，将上下文理解与迭代去噪解耦为两个独立模块，在保留预训练自回归模型近99%生成质量的同时，将文本生成速度提升了2.4倍，实现了更高效的并行生成。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要