菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-09
📄 Abstract - When to Align, When to Predict: A Phase Diagram for Multimodal Learning

Cross-modal alignment (CA) and cross-modal prediction (CP) are the dominant paradigms for multimodal representation learning, yet there is no systematic understanding of when each succeeds, when each fails, and when cross-modal training helps at all -- a gap that leaves practitioners, especially in scientific domains like biomedicine or astrophysics, with heterogeneous instruments and multiple levels of organization and measurement, unable to diagnose why standard methods underperform the best single modality. We develop a unified linear framework that addresses both questions. Under a spiked signal-plus-noise model with structured cross-modal nuisance correlation, we derive separation ratios for both objectives that expose complementary failure modes: alignment whitens each modality and fails when nuisance is strongly correlated across views; prediction encodes whatever is cross-predictable through a one-sided whitening, with recovery governed by source-modality quality. The resulting phase diagram partitions multimodal problems into four regimes: Both, CA only, CP only, and Neither. We present a data-driven procedure to locate real-world datasets in this diagram using a small labeled subsample, identifying the preferred objective and prediction direction before any cross-modal training. Experiments on synthetic data, stereo-vision benchmarks, image-caption pairs, and real astrophysical data validate the predictions in the nonlinear regime, including the Neither regime where cross-modal training is actively harmful. Our framework lets practitioners diagnose their multimodal problem and choose the right objective before committing to training. Code to reproduce the results is available at this https URL.

顶级标签: machine learning multi-modal theory
详细标签: cross-modal alignment cross-modal prediction phase diagram multimodal representation learning nuisance correlation 或 搜索:

何时对齐,何时预测:多模态学习的相图 / When to Align, When to Predict: A Phase Diagram for Multimodal Learning


1️⃣ 一句话总结

本文提出一个统一的理论框架,通过分析跨模态对齐和跨模态预测的优缺点,构建了一张“相图”来指导研究人员在不同类型的多模态数据中,选择最有效的学习策略,甚至指出何时不应使用多模态训练以避免效果变差。

源自 arXiv: 2606.11190