SkillJuror:衡量智能体技能组织如何改变运行时行为 / SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior
1️⃣ 一句话总结
本文提出了一个名为SkillJuror的评估框架,通过对比两种技能组织方式(渐进式展开与扁平式基线),发现技能的组织结构而非仅技能内容本身,会显著影响大语言模型智能体在运行时如何查找和应用知识,从而改变任务执行行为,但最终效果提升取决于任务特性。
Agent Skills augment large language model (LLM) agents with procedural knowledge at inference time, but current benchmarks rarely distinguish what a Skill says from how it is organized. We study this distinction through Progressive Disclosure, where a concise root file points agents to supporting resources on demand, and compare it with a normalized flat baseline. We present SkillJuror, a framework for evaluating Skill writing paradigms through semantically controlled variants, matched multi-trial evaluations, and trajectory evidence while holding task knowledge fixed. In an 82-task SkillsBench study, Progressive Disclosure changes runtime behavior before aggregate outcomes: distinct Skill resources touched per trajectory rise from 1.18 to 3.85, and effective uptake events rise from 1.33 to 3.92. It also yields 17 additional verifier-passing trials out of 410 matched trials (+4.1%) over the normalized flat baseline. The benefit is task-dependent. Progressive Disclosure helps when supporting resources guide implementation, checking, or repair, but is weaker when success hinges on exact output conventions, numerical thresholds, or long artifact-generation pipelines. These results show that Skill organization is not mere presentation: it can change how agents search and apply procedural knowledge, while outcome gains depend on whether the exposed resources are actionable for the task. Code is available at this https URL.
SkillJuror:衡量智能体技能组织如何改变运行时行为 / SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior
本文提出了一个名为SkillJuror的评估框架,通过对比两种技能组织方式(渐进式展开与扁平式基线),发现技能的组织结构而非仅技能内容本身,会显著影响大语言模型智能体在运行时如何查找和应用知识,从而改变任务执行行为,但最终效果提升取决于任务特性。
源自 arXiv: 2606.11543