菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-06
📄 Abstract - Rethinking Model Efficiency: Multi-Agent Inference with Large Models

Most vision-language models (VLMs) apply a large language model (LLM) as the decoder, where the response tokens are generated sequentially through autoregression. Therefore, the number of output tokens can be the bottleneck of the end-to-end latency. However, different models may require vastly different numbers of output tokens to achieve comparable performance. In this work, we conduct a comprehensive analysis of the latency across different components of VLMs on simulated data. The experiment shows that a large model with fewer output tokens can be more efficient than a small model with a long output sequence. The empirical study on diverse real-world benchmarks confirms the observation that a large model can achieve better or comparable performance as a small model with significantly fewer output tokens. To leverage the efficiency of large models, we propose a multi-agent inference framework that keeps large models with short responses but transfers the key reasoning tokens from the small model when necessary. The comparison on benchmark tasks demonstrates that by reusing the reasoning tokens from small models, it can help approach the performance of a large model with its own reasoning, which confirms the effectiveness of our proposal.

顶级标签: multi-modal model evaluation systems
详细标签: vision-language models inference latency autoregressive decoding multi-agent inference model efficiency 或 搜索:

重新思考模型效率:大模型的多智能体推理 / Rethinking Model Efficiency: Multi-Agent Inference with Large Models


1️⃣ 一句话总结

这篇论文发现,在视觉语言模型中,一个输出简短的大模型可能比一个输出冗长的小模型更高效,并提出了一个多智能体推理框架,通过让小模型提供关键推理信息来帮助大模型,从而在保持高效率的同时提升性能。

源自 arXiv: 2604.04929