菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-09
📄 Abstract - Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.

顶级标签: agents model training multi-modal
详细标签: tool usage meta-cognition reinforcement learning efficiency conditional optimization 或 搜索:

明智行动:在具身多模态模型中培养元认知工具使用能力 / Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models


1️⃣ 一句话总结

这篇论文提出了一个名为HDPO的新框架,旨在解决当前多模态AI代理在决定何时使用外部工具时存在的‘元认知缺陷’问题,该框架通过将任务准确性和工具使用效率分开优化,成功训练出既能大幅减少不必要工具调用、又能提升推理准确性的智能模型Metis。

源自 arXiv: 2604.08545