InfoSynth: Information-Guided Benchmark Synthesis for LLMs

📄 Abstract - InfoSynth: Information-Guided Benchmark Synthesis for LLMs

Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation. However, efficiently creating new benchmarks to evaluate these capabilities remains a challenge. Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming. Furthermore, existing benchmarks often contaminate LLM training data, necessitating novel and diverse benchmarks to accurately assess their genuine capabilities. This work introduces InfoSynth, a novel framework for automatically generating and evaluating reasoning benchmarks guided by information-theoretic principles. We propose metrics based on KL-divergence and entropy to quantify benchmark novelty and diversity without relying on costly model evaluations. Building on this framework, we develop an end-to-end pipeline that synthesizes robust Python coding problems from seed datasets using genetic algorithms and iterative code feedback. Our method generates accurate test cases and solutions to new problems 97% of the time, and the synthesized benchmarks consistently exhibit higher novelty and diversity compared to their seed datasets. Moreover, our algorithm provides a method for controlling the novelty/diversity and difficulty of generated problems. InfoSynth offers a scalable, self-verifying pipeline for constructing high-quality, novel and diverse benchmarks for LLMs. Project Page: this https URL

InfoSynth：面向大语言模型的信息引导式基准测试合成框架 / InfoSynth: Information-Guided Benchmark Synthesis for LLMs

1️⃣ 一句话总结

这篇论文提出了一种名为InfoSynth的自动化框架，它利用信息论原理和遗传算法，能够高效地生成新颖且多样化的编程问题来测试大语言模型的能力，解决了传统人工创建测试集成本高、易被模型‘见过’的难题。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要