MathDuels: Evaluating LLMs as Problem Posers and Solvers

📄 Abstract - MathDuels: Evaluating LLMs as Problem Posers and Solvers

As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems. Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled, and that dual-role evaluation reveals capability separations invisible in single-role benchmarks. As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark's difficulty co-evolves with participant strength rather than saturating at a fixed ceiling. We host a public leaderboard that updates as new models are released.

数学对决：评估大型语言模型既出题又解题的能力 / MathDuels: Evaluating LLMs as Problem Posers and Solvers

1️⃣ 一句话总结

本文提出了一种名为MathDuels的新型评估方法，让大语言模型在对抗性环境中既扮演“出题者”又扮演“解题者”，从而揭示出传统静态测试无法区分的模型能力差异，并且随着更强模型的加入，题目难度会自动提升、避免测试天花板效应。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要