菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-05
📄 Abstract - Mario: Multimodal Graph Reasoning with Large Language Models

Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at this https URL.

顶级标签: llm multi-modal model training
详细标签: multimodal graph reasoning graph-conditioned vlm modality-adaptive instruction tuning cross-modal contrastive learning node classification 或 搜索:

Mario:基于大语言模型的多模态图推理 / Mario: Multimodal Graph Reasoning with Large Language Models


1️⃣ 一句话总结

这篇论文提出了一个名为Mario的新框架,它能让大语言模型更好地理解和推理同时包含图像、文本以及它们之间复杂关系的多模态图数据,从而在多项任务上超越了现有方法。

源自 arXiv: 2603.05181