菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-11
📄 Abstract - Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?

Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC), a core perception task that requires subtle visual discrimination and is crucial for many real-world applications. A widely adopted strategy for boosting performance on challenging tasks such as math and coding is Chain-of-Thought (CoT) reasoning. However, several prior works have reported that CoT can actually harm performance on visual perception tasks. These studies, though, examine the issue from relatively narrow angles and leave open why CoT degrades perception-heavy performance. We systematically re-examine the role of CoT in FGVC through the lenses of zero-shot evaluation and multiple training paradigms. Across these settings, we uncover a central paradox: the degradation induced by CoT is largely driven by the reasoning length, in which longer textual reasoning consistently lowers classification accuracy. We term this phenomenon the ``Cost of Thinking''. Building on this finding, we make two key contributions: (1) \alg, a simple and general plug-and-play normalization method for multi-reward optimization that balances heterogeneous reward signals, and (2) ReFine-RFT, a framework that combines ensemble rewards with \alg to constrain reasoning length while providing dense accuracy-oriented feedback. Extensive experiments demonstrate the effectiveness of our findings and the proposed ReFine-RFT, achieving state-of-the-art performance across FGVC benchmarks. Code and models are available at \href{this https URL}{Project Link}.

顶级标签: multi-modal model evaluation natural language processing
详细标签: fine-grained visual classification chain-of-thought reasoning length multi-reward optimization visual-language models 或 搜索:

文本推理能否提升多模态大语言模型在细粒度视觉分类上的性能? / Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?


1️⃣ 一句话总结

这篇论文发现,在多模态大模型执行细粒度图像分类任务时,让模型进行更长的文本推理(即“多思考”)反而会降低其分类准确率,作者将这一现象称为“思考的代价”,并提出了新的训练框架来约束推理长度、提升模型性能。

源自 arXiv: 2601.06993