Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach

📄 Abstract - Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach

We introduce CoMET, \textit{\textbf{C}omposing \textbf{M}odality \textbf{E}ncoders with \textbf{T}abular foundation models}, a simple yet highly competitive method for multimodal classification: pass each modality through a frozen pre-trained backbone, compress the resulting embeddings with PCA, and concatenate as input into a Tabular Foundation Model (TFM) for prediction. We show that PCA alone suffices to act as an adaptor yielding strong, robust performance across modalities. When the \texttt{CLS} tokens of the foundation model align poorly with downstream tasks, we propose \textbf{PALPooling}, a lightweight adaptive token pooler that consistently improves representation quality. By composing strong frozen representation learning backbones with TFMs, our approach achieves state-of-the-art results across diverse multimodal benchmarks without any training. On hierarchical tasks with large fine-grained class spaces, our approach enables fast and scalable classification, handling datasets with over 500,000 samples and 2,000 classes without any fine-tuning. Overall, our results show that the composition of foundation models is a simple, yet powerful, out-of-the-box solution for multimodal learning, challenging the necessity of complex, end-to-end training pipelines for new problems.

无需微调的多模态分类：一种简单的组合方法 / Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach

1️⃣ 一句话总结

本文提出一种名为CoMET的简单方法，通过将不同模态数据分别输入冻结的预训练骨干网络，用主成分分析压缩特征后拼接，再送入表格基础模型进行分类，无需任何微调即可在多个多模态基准上达到最优效果，甚至能处理超过50万样本和2000类别的超大规模分类任务。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要