GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

📄 Abstract - GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

Multimodal large language models (MLLMs) have exhibited remarkable performance in various visual tasks, yet still struggle with spatial reasoning. Recent efforts mitigate this by injecting geometric features from 3D foundation models, but rely on static single-layer extractions. We identify that such an approach induces a task misalignment bias: the geometric features naturally evolve towards 3D pretraining objectives, which may contradict the heterogeneous spatial demands of MLLMs, rendering any single layer fundamentally insufficient. To resolve this, we propose GeoAlign, a novel framework that dynamically aggregates multi-layer geometric features to realign with the actual demands. GeoAlign constructs a hierarchical geometric feature bank and leverages the MLLM's original visual tokens as content-aware queries to perform layer-wise sparse routing, adaptively fetching the suitable geometric features for each patch. Extensive experiments on VSI-Bench, ScanQA, and SQA3D demonstrate that our compact 4B model effectively achieves state-of-the-art performance, even outperforming larger existing MLLMs.

GeoAlign：用于多模态大语言模型空间推理的几何特征重对齐 / GeoAlign: Geometric Feature Realignment for MLLM Spatial Reasoning

1️⃣ 一句话总结

这篇论文提出了一个名为GeoAlign的新框架，通过动态聚合3D模型的多层几何特征并与视觉内容对齐，有效解决了现有多模态大模型在空间推理任务上的不足，使小型模型也能达到顶尖性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要