DSGym:一个用于评估和训练数据科学智能体的整体框架 / DSGym: A Holistic Framework for Evaluating and Training Data Science Agents
1️⃣ 一句话总结
这篇论文提出了一个名为DSGym的标准化框架,它通过提供一个可扩展的、包含真实数据执行环境的测试平台,解决了现有数据科学智能体评估标准不统一、任务覆盖面窄且容易取巧的问题,并展示了如何利用该框架训练出超越GPT-4o的模型。
Data science agents promise to accelerate discovery and insight-generation by turning data into executable analyses and findings. Yet existing data science benchmarks fall short due to fragmented evaluation interfaces that make cross-benchmark comparison difficult, narrow task coverage and a lack of rigorous data grounding. In particular, we show that a substantial portion of tasks in current benchmarks can be solved without using the actual data. To address these limitations, we introduce DSGym, a standardized framework for evaluating and training data science agents in self-contained execution environments. Unlike static benchmarks, DSGym provides a modular architecture that makes it easy to add tasks, agent scaffolds, and tools, positioning it as a live, extensible testbed. We curate DSGym-Tasks, a holistic task suite that standardizes and refines existing benchmarks via quality and shortcut solvability filtering. We further expand coverage with (1) DSBio: expert-derived bioinformatics tasks grounded in literature and (2) DSPredict: challenging prediction tasks spanning domains such as computer vision, molecular prediction, and single-cell perturbation. Beyond evaluation, DSGym enables agent training via execution-verified data synthesis pipeline. As a case study, we build a 2,000-example training set and trained a 4B model in DSGym that outperforms GPT-4o on standardized analysis benchmarks. Overall, DSGym enables rigorous end-to-end measurement of whether agents can plan, implement, and validate data analyses in realistic scientific context.
DSGym:一个用于评估和训练数据科学智能体的整体框架 / DSGym: A Holistic Framework for Evaluating and Training Data Science Agents
这篇论文提出了一个名为DSGym的标准化框架,它通过提供一个可扩展的、包含真实数据执行环境的测试平台,解决了现有数据科学智能体评估标准不统一、任务覆盖面窄且容易取巧的问题,并展示了如何利用该框架训练出超越GPT-4o的模型。
源自 arXiv: 2601.16344