InnoGym:评估AI智能体创新潜力的基准测试 / InnoGym: Benchmarking the Innovation Potential of AI Agents
1️⃣ 一句话总结
这篇论文提出了首个专门评估AI智能体创新潜力的基准测试框架InnoGym,它通过‘性能增益’和‘方法新颖性’两个指标来衡量智能体是否不仅能给出正确答案,还能提出原创性的解决方案,揭示了当前AI在创造性与有效性之间存在差距。
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
InnoGym:评估AI智能体创新潜力的基准测试 / InnoGym: Benchmarking the Innovation Potential of AI Agents
这篇论文提出了首个专门评估AI智能体创新潜力的基准测试框架InnoGym,它通过‘性能增益’和‘方法新颖性’两个指标来衡量智能体是否不仅能给出正确答案,还能提出原创性的解决方案,揭示了当前AI在创造性与有效性之间存在差距。