菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-21
📄 Abstract - Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models

Multi-modality medical vision (MV) foundation models (FM) are fundamentally challenged by pronounced Non-IID feature statistics across heterogeneous imaging modalities. Monolithic self-supervised optimization on such data induces conflicting gradients, driving representations to collapse toward modality-dominant shortcuts. This work reframes this failure as an imbalance between specialization and coordination in emergent modularity, and proposes Director-Experts (DEX), a modular network that explicitly regulates these dynamics in stacked modules. Each DEX module comprises a pool of experts, dynamically adapted by our image-wise activation strategy, autonomously specializing in modality-dominant statistics, together with a director, updated via our group exponential moving average, which distills multi-expert knowledge into a shared space for semantic integration across modalities, thus driving the emergence of modular representations. We curate a new benchmark, Medical Vision Universe, over 4 million images across 10 modalities, which provides a FM-level pre-training with the broadest coverage of distinct imaging modalities to our DEX. Extensive evaluations on 26 downstream tasks demonstrate improved optimization behavior and transferability, indicating DEX as a principled step toward general-purpose multi-modality medical AI. Our code and dataset will be opened at this https URL.

顶级标签: medical multi-modal model training
详细标签: modular representation foundation model self-supervised learning modality imbalance benchmark 或 搜索:

多模态医学视觉基础模型中的涌现模块化表征学习 / Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models


1️⃣ 一句话总结

本文提出一种名为Director-Experts(DEX)的新型模块化网络,通过让不同专家模块自动学习各自擅长处理一种医学影像模态的特征,再用一个“导演”模块将各专家知识融合到共享空间中,从而有效解决了多模态医学图像预训练中不同模态数据差异巨大导致的模型性能下降问题,并在涵盖10种模态、400万图像的数据集上验证了其优越性。

源自 arXiv: 2605.21861