菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-11
📄 Abstract - Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

While Chain-of-Thought empowers Large Vision-Language Models with multi-step reasoning, explicit textual rationales suffer from an information bandwidth bottleneck, where continuous visual details are discarded during discrete tokenization. Recent latent reasoning methods attempt to address this challenge, but often fall prey to premature semantic collapse due to rigid autoregressive objectives. In this paper, we propose Laser, a novel paradigm that reformulates visual deduction via Dynamic Windowed Alignment Learning (DWAL). Instead of forcing a point-wise prediction, Laser aligns the latent state with a dynamic validity window of future semantics. This mechanism enforces a "Forest-before-Trees" cognitive hierarchy, enabling the model to maintain a probabilistic superposition of global features before narrowing down to local details. Crucially, Laser maintains interpretability via decodable trajectories while stabilizing unconstrained learning via Self-Refined Superposition. Extensive experiments on 6 benchmarks demonstrate that Laser achieves state-of-the-art performance among latent reasoning methods, surpassing the strong baseline Monet by 5.03% on average. Notably, it achieves these gains with extreme efficiency, reducing inference tokens by more than 97%, while demonstrating robust generalization to out-of-distribution domains.

顶级标签: multi-modal model training natural language processing
详细标签: visual reasoning latent reasoning efficient inference vision-language models dynamic alignment 或 搜索:

先见森林后见树:用于高效视觉推理的潜在叠加 / Forest Before Trees: Latent Superposition for Efficient Visual Reasoning


1️⃣ 一句话总结

这篇论文提出了一种名为Laser的新方法,它通过让模型先在脑海里形成对图像的整体理解(‘森林’),再逐步聚焦到局部细节(‘树木’),从而在保持高准确率的同时,极大地提升了视觉推理的效率和泛化能力。

源自 arXiv: 2601.06803