菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-20
📄 Abstract - DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., vision and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 11 advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, GPT-5.2 is the most efficient, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04% to 11.30%. Overall, while current data science agents perform well on structured data and routine data anlysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions to advance the development of data science agents.

顶级标签: benchmark agents model evaluation
详细标签: data science agents evaluation benchmark multimodal interaction llm agents real-world tasks 或 搜索:

DSAEval:一个用于评估数据科学代理的综合基准 / DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems


1️⃣ 一句话总结

本文提出了DSAEval,一个包含641个真实世界数据科学问题、覆盖多领域和多模态数据的综合性基准,旨在通过多模态环境感知、多轮查询交互和多维度评估来全面评估基于大语言模型的数据科学代理的能力。


2️⃣ 论文创新点

1. 综合性真实世界基准

2. 多模态环境感知

3. 多轮查询交互

4. 多维度评估协议


3️⃣ 主要结果与价值

结果亮点

实际价值


4️⃣ 术语表

源自 arXiv: 2601.13591