菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-02
📄 Abstract - ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context

Next-generation AI must manage vast personal data, diverse tools, and multi-step reasoning, yet most benchmarks remain context-free and single-turn. We present ASTRA-bench (Assistant Skills in Tool-use, Reasoning \& Action-planning), a benchmark that uniquely unifies time-evolving personal context with an interactive toolbox and complex user intents. Our event-driven pipeline generates 2,413 scenarios across four protagonists, grounded in longitudinal life events and annotated by referential, functional, and informational complexity. Evaluation of state-of-the-art models (e.g., Claude-4.5-Opus, DeepSeek-V3.2) reveals significant performance degradation under high-complexity conditions, with argument generation emerging as the primary bottleneck. These findings expose critical limitations in current agents' ability to ground reasoning within messy personal context and orchestrate reliable multi-step plans. We release ASTRA-bench with a full execution environment and evaluation scripts to provide a diagnostic testbed for developing truly context-aware AI assistants.

顶级标签: agents benchmark llm
详细标签: tool-use agents personal context multi-step reasoning evaluation benchmark action planning 或 搜索:

ASTRA-bench:基于个人用户情境评估工具使用智能体的推理与行动规划能力 / ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context


1️⃣ 一句话总结

这篇论文提出了一个名为ASTRA-bench的新基准测试,它通过结合动态变化的个人生活情境和复杂任务来评估AI助手使用工具、进行推理和制定多步骤计划的能力,结果发现当前最先进的AI模型在处理高复杂性个人情境任务时表现显著下降,揭示了其在现实场景中的关键局限。

源自 arXiv: 2603.01357