菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-26
📄 Abstract - Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning

Multimodal instruction tuning is the de facto recipe for adapting vision language models (VLMs), yet instruction data are highly redundant, making data selection critical for training efficiency. Existing methods derive selection signals from a specific model or dataset, so whenever the target model or candidate pool changes, the criteria must be recomputed from scratch at substantial cost. To address this, we propose OFA, a data selection framework that trains a reusable selector once and applies it to any dataset or model without recomputation. OFA clusters multimodal instructions in a frozen CLIP space, derives pseudo labels from the cluster structure, and trains a lightweight selector for only a few epochs; samples on which this selector is least confident are selected as the most informative. Once trained, the frozen selector transfers directly across datasets and model scales. The selector is trained once on LLaVA-665K and applied both to LLaVA-665K itself and, without any retraining, to the unseen Vision-Flan-186K. Selecting only 15% of the data, OFA achieves 98.3% of full data performance across 10 downstream benchmarks; on the smaller Vision-Flan-186K, the transferred selector surpasses full data training by 10.6%, confirming that the learned signal generalizes to datasets never seen during selector training. The same selected subsets benefit VLMs at both Qwen2.5-VL-3B and LLaVA-v1.5-7B without per model recomputation, decoupling selection from the target model. These results demonstrate that a single, transferable selector provides an effective and reusable solution for efficient multimodal instruction tuning.

顶级标签: machine learning multi-modal model training
详细标签: data selection multimodal instruction tuning vision language models transferable selector training efficiency 或 搜索:

一次训练、随时选择:面向多模态指令微调的统一框架 / Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning


1️⃣ 一句话总结

该论文提出一个名为OFA的多模态指令数据选择框架,通过仅训练一次轻量级选择器,即可无需重新计算地适用于不同数据集和不同视觉语言模型,仅用15%的数据就能达到甚至超过全量数据训练的模型性能。

源自 arXiv: 2605.26761