大型视觉语言模型的并行上下文学习 / Parallel In-context Learning for Large Vision Language Models
1️⃣ 一句话总结
本文提出了一种名为‘并行上下文学习’的新方法,通过将长示例拆分成多个短片段并行处理再智能整合,让大型视觉语言模型在保持高准确率的同时,大幅提升了任务适应时的推理速度。
Large vision-language models (LVLMs) employ multi-modal in-context learning (MM-ICL) to adapt to new tasks by leveraging demonstration examples. While increasing the number of demonstrations boosts performance, they incur significant inference latency due to the quadratic computational cost of Transformer attention with respect to the context length. To address this trade-off, we propose Parallel In-Context Learning (Parallel-ICL), a plug-and-play inference algorithm. Parallel-ICL partitions the long demonstration context into multiple shorter, manageable chunks. It processes these chunks in parallel and integrates their predictions at the logit level, using a weighted Product-of-Experts (PoE) ensemble to approximate the full-context output. Guided by ensemble learning theory, we introduce principled strategies for Parallel-ICL: (i) clustering-based context chunking to maximize inter-chunk diversity and (ii) similarity-based context compilation to weight predictions by query relevance. Extensive experiments on VQA, image captioning, and classification benchmarks demonstrate that Parallel-ICL achieves performance comparable to full-context MM-ICL, while significantly improving inference speed. Our work offers an effective solution to the accuracy-efficiency trade-off in MM-ICL, enabling dynamic task adaptation with substantially reduced inference overhead.
大型视觉语言模型的并行上下文学习 / Parallel In-context Learning for Large Vision Language Models
本文提出了一种名为‘并行上下文学习’的新方法,通过将长示例拆分成多个短片段并行处理再智能整合,让大型视觉语言模型在保持高准确率的同时,大幅提升了任务适应时的推理速度。
源自 arXiv: 2603.16092