📄
Abstract - GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language
We introduce GeoBuildBench, a benchmark designed to evaluate whether large language models and multimodal agents can ground informal natural-language plane geometry problems into executable geometric constructions. Unlike existing geometry benchmarks that focus on answer correctness or static diagram interpretation, GeoBuildBench treats geometry diagram as an interactive construction task: given a textual problem, an agent must generate a domain-specific language (DSL) program to produce a diagram satisfying explicitly specified geometric objects and verifiable constraints. The benchmark features 489 Chinese textbook-style problems, curated through automated filtering and human validation to ensure text-complete, constructible problem specifications. We evaluate several state-of-the-art multimodal models in a bounded iterative setting and show that, despite reasonable success rates, models frequently exhibit structural hallucinations, missing objects, and failures to satisfy geometric constraints, with limited ability to exploit visual and constraint-based feedback for self-correction. These results highlight geometry construction as a rigorous testbed for grounded, executable reasoning beyond textual or visual plausibility. Our benchmark and code are publicly available.
GeoBuildBench:一个面向自然语言交互式可执行几何构建的基准测试 /
GeoBuildBench: A Benchmark for Interactive and Executable Geometry Construction from Natural Language
1️⃣ 一句话总结
该论文提出了一个新基准GeoBuildBench,旨在测试AI模型能否根据自然语言描述,像人一样一步步用程序构建出符合几何条件的图形,实验发现现有模型虽能部分成功,但常犯结构性错误且难以自我修正,凸显了真正可执行的几何推理仍是重大挑战。