多模态对齐的因陀罗表示假说 / The Indra Representation Hypothesis for Multimodal Alignment
1️⃣ 一句话总结
这篇论文提出了一种名为‘因陀罗表示’的新理论,认为不同模态的基础模型其实学到了相似的内在关系结构,并利用数学方法将其形式化,从而无需额外训练就能有效提升跨模型和跨模态任务的鲁棒性与对齐效果。
Recent studies have uncovered an interesting phenomenon: unimodal foundation models tend to learn convergent representations, regardless of differences in architecture, training objectives, or data modalities. However, these representations are essentially internal abstractions of samples that characterize samples independently, leading to limited expressiveness. In this paper, we propose The Indra Representation Hypothesis, inspired by the philosophical metaphor of Indra's Net. We argue that representations from unimodal foundation models are converging to implicitly reflect a shared relational structure underlying reality, akin to the relational ontology of Indra's Net. We formalize this hypothesis using the V-enriched Yoneda embedding from category theory, defining the Indra representation as a relational profile of each sample with respect to others. This formulation is shown to be unique, complete, and structure-preserving under a given cost function. We instantiate the Indra representation using angular distance and evaluate it in cross-model and cross-modal scenarios involving vision, language, and audio. Extensive experiments demonstrate that Indra representations consistently enhance robustness and alignment across architectures and modalities, providing a theoretically grounded and practical framework for training-free alignment of unimodal foundation models. Our code is available at this https URL.
多模态对齐的因陀罗表示假说 / The Indra Representation Hypothesis for Multimodal Alignment
这篇论文提出了一种名为‘因陀罗表示’的新理论,认为不同模态的基础模型其实学到了相似的内在关系结构,并利用数学方法将其形式化,从而无需额外训练就能有效提升跨模型和跨模态任务的鲁棒性与对齐效果。
源自 arXiv: 2604.04496