菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-15
📄 Abstract - LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.

顶级标签: multi-modal model training natural language processing
详细标签: visual reasoning attention alignment knowledge distillation multimodal grounding curriculum learning 或 搜索:

LaViT:对齐潜在视觉思维以实现多模态推理 / LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning


1️⃣ 一句话总结

这篇论文提出了一个名为LaViT的新框架,通过让学生模型在生成文本前先学习并复现教师模型的视觉关注轨迹和语义理解,有效解决了多模态推理中模型仅依赖语言先验而忽视真实视觉感知的问题,从而显著提升了模型的视觉基础能力,让小模型也能在复杂推理任务上取得优异表现。

源自 arXiv: 2601.10129