SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

📄 Abstract - SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

Spreadsheets are widely used for business analysis, financial modeling, reporting, and decision-making. However, most existing spreadsheet benchmarks evaluate isolated operations such as single-formula generation or local cell edits, and therefore fail to capture end-to-end workflows in realistic business settings. We introduce \textsc{SpreadsheetBench 2}, a workflow-level benchmark for spreadsheet agents that covers three task categories: generation, debugging, and visualization. The benchmark is constructed from authentic business data, including financial reports and corporate filings, and is annotated and validated by domain experts. The benchmark contains 321 tasks; each instance averages 11.8 worksheets and requires 593.5 cell modifications, reflecting large multi-sheet workbooks with cross-sheet dependencies. We evaluate eight frontier large language models under a unified multi-turn agent scaffold, and additionally include several LLM-based spreadsheet products as complementary baselines. Results show that current systems remain far from reliable on real-world workflows: the best model achieves 34.89\% overall task accuracy, and debugging accuracy is as low as 12.00\%. Trajectory analysis and a failure taxonomy further indicate that insufficient spreadsheet inspection and incorrect target-cell selection are the dominant bottlenecks. Together, these findings position \textsc{SpreadsheetBench 2} as a challenging testbed for advancing reliable spreadsheet automation. Project page: this https URL

电子表格基准测试2：评估智能体在端到端商业电子表格工作流中的表现 / SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

1️⃣ 一句话总结

该论文提出了一个名为SpreadsheetBench 2的基准测试，专注于评估AI智能体在真实商业场景中处理复杂、多表格、跨工作表依赖的端到端电子表格任务（如生成、调试和可视化）的能力，并发现当前最先进的模型在此类任务上准确率普遍较低，主要瓶颈在于对表格的全面检查不足以及目标单元格的选择错误。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要