arXiv最新AI论文速览速学

🔍

标签: #tool-augmented reasoning ✕ 清除筛选

搜索范围：

全部标题和内容仅标签

🏷️ 所有标签

24小时内新更新论文 24h更新 95 72小时内新更新论文 72h更新 100 最新: The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality 12-12

arXiv ID: 2512.10791

arXiv 提交日期: 2025-12-11

llm benchmark model evaluation factuality evaluation multimodal assessment knowledge recall tool-augmented reasoning automated scoring

FACTS排行榜：一个用于全面评估大语言模型事实准确性的在线基准套件 / The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

1️⃣ 一句话总结

本文介绍了FACTS Leaderboard，一个整合了四个独立子基准的综合性在线评估平台，旨在通过多维度、标准化的方式全面衡量大语言模型在各种场景下生成事实准确文本的能力。

👋 没兴趣 ☆ 感兴趣

📌 待读 PDF

arXiv ID: 2511.21689

arXiv 提交日期: 2025-11-26

llm agents model training tool orchestration reinforcement learning efficient inference tool-augmented reasoning model coordination

工具交响乐：通过高效的模型与工具编排提升智能 / ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

1️⃣ 一句话总结

这篇论文提出了一种名为ToolOrchestra的方法，通过训练一个小型的‘指挥家’模型来协调调用各种智能工具，从而在解决复杂任务时，以更低的成本实现了比大型语言模型（如GPT-5）更高的性能和效率，并且能更好地满足用户偏好。

👋 没兴趣 ☆ 感兴趣

📌 待读 PDF

arXiv ID: 2511.11793

arXiv 提交日期: 2025-11-14

agents llm model training research agents tool-augmented reasoning interaction scaling reinforcement learning benchmark evaluation

MiroThinker：通过模型、上下文和交互扩展提升开源研究智能体性能边界 / MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

1️⃣ 一句话总结

这篇论文提出了一个名为MiroThinker的开源研究智能体，它通过增加模型与环境交互的深度和频率来提升性能，在多项测试中表现优异，接近商业系统的水平，并证明交互扩展是提升智能体能力的第三个关键维度，与模型规模和上下文长度同样重要。

👋 没兴趣 ☆ 感兴趣

📌 待读 PDF

arXiv ID: 2511.10899

arXiv 提交日期: 2025-11-14

llm model evaluation agents tool-augmented reasoning reasoning hallucinations code interpreter mathematical reasoning preference optimization

从证明到程序：揭示大型语言模型中工具引发的推理幻觉 / From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

1️⃣ 一句话总结

这项研究发现，尽管使用代码解释器等外部工具能提升语言模型的答案准确率，但会导致模型过度依赖工具输出而忽视逻辑推理过程，产生看似正确但缺乏合理性的解决方案，研究者通过优化方法成功改善了这一问题。

👋 没兴趣 ☆ 感兴趣

📌 待读 PDF

arXiv最新AI论文速览速学

菜单

arXiv ID: 2512.10791

1️⃣ 一句话总结

arXiv ID: 2511.21689

1️⃣ 一句话总结

arXiv ID: 2511.11793

1️⃣ 一句话总结

arXiv ID: 2511.10899

1️⃣ 一句话总结

📄 提交新论文

提交新论文

密码管理

设置密码

修改密码

移除密码

菜单

热门趋势

arXiv ID: 2512.10791 👀 Abstract

1️⃣ 一句话总结

arXiv ID: 2511.21689 👀 Abstract

1️⃣ 一句话总结

arXiv ID: 2511.11793 👀 Abstract

1️⃣ 一句话总结

arXiv ID: 2511.10899 👀 Abstract

1️⃣ 一句话总结

获取最新论文摘要

📄 提交新论文

需要登录

提交新论文

需要登录

arXiv ID: 2512.10791

arXiv ID: 2511.21689

arXiv ID: 2511.11793

arXiv ID: 2511.10899