菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-25
📄 Abstract - Explore Before You Solve: The Speed--Depth Trade-off in Epistemic Agents for ARC-AGI-3

We systematically investigate all 25 public ARC-AGI-3 games and find that every one is reachable through non-intelligent strategies: 10 in a single blind step, 5 after one probing action, 1 via repeated ACTION1 presses, 1 via diverse exploration, and 8 via single repeated actions with sufficient budget (50-200 steps). A library-level null-coordinate vulnerability additionally bypasses 18 games in 1 step. This benchmark critique implies the public evaluation set cannot discriminate intelligent exploration from trivial heuristics - the private 55-game evaluation is the only genuine intelligence test. Against this backdrop, we present AERA (Adaptive Epistemic Reasoning Agent), a three-phase (EXPLORE / VERIFY / PLAN) agent achieving RHAE=0.2116 (4/25 solved) on these 25 games with Qwen2.5-0.5B, while random and no-explore baselines score 0.0000. We formalise AERA through a Speed--Depth trade-off framework: under a convexity assumption (proved for a class of environments in the Appendix), RHAE's quadratic form emerges as a second-order penalty for deviating from the Pareto frontier between action efficiency and information gain. Contributions: (i) a benchmark validity analysis showing that current interactive reasoning benchmarks fail to measure the exploration they claim to require, and (ii) the EXPLORE-before-PLAN framework and model-capability x exploration interaction. The linked code track entry achieves RHAE=0.30 on the full 55-game private evaluation. Code: CC0.

顶级标签: agents benchmark reinforcement learning
详细标签: arc-agi exploration speed-depth trade-off epistemic reasoning benchmark critique 或 搜索:

探索再求解:面向ARC-AGI-3认知智能体的速度与深度权衡 / Explore Before You Solve: The Speed--Depth Trade-off in Epistemic Agents for ARC-AGI-3


1️⃣ 一句话总结

本文揭示ARC-AGI-3公开测试集存在严重漏洞:大部分题目无需智能推理,仅凭简单试探步骤即可通过;为解决此问题,作者提出一个分三阶段(探索/验证/规划)的认知智能体AERA,并通过速度与探索深度的权衡理论,证明高效智能体必须优先进行信息探索,才能在真正的智能测试中取得好成绩。

源自 arXiv: 2605.25931