📄
Abstract - DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling
Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis. However, existing approaches primarily focus on unstructured web data, while the challenges of conducting deep research over large-scale structured databases remain relatively underexplored. Unlike web-based research, effective data-centric research requires more than retrieval and summarization and demands iterative hypothesis generation, quantitative reasoning over structured schemas, and convergence toward a coherent analytical narrative. In this paper, we present DataSTORM, an LLM-based agentic system capable of autonomously conducting research across both large-scale structured databases and internet sources. Grounded in principles from Exploratory Data Analysis and Data Storytelling, DataSTORM reframes deep research over structured data as a thesis-driven analytical process: discovering candidate theses from data, validating them through iterative cross-source investigation, and developing them into coherent analytical narratives. We evaluate DataSTORM on InsightBench, where it achieves a new state-of-the-art result with a 19.4% relative improvement in insight-level recall and 7.2% in summary-level score. We further introduce a new dataset built on ACLED, a real-world complex database, and demonstrate that DataSTORM outperforms proprietary systems such as ChatGPT Deep Research across both automated metrics and human evaluations.
DataSTORM:利用探索性数据分析和数据叙事对大规模数据库进行深度研究 /
DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling
1️⃣ 一句话总结
这篇论文提出了一个名为DataSTORM的AI智能体系统,它能够像做研究一样,自主地从大规模结构化数据库和互联网中发现问题、验证假设并生成连贯的数据分析报告,显著提升了深度数据研究的自动化水平。