菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-15
📄 Abstract - s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In \textit{s2n-bignum-bench}, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, \textit{s2n-bignum-bench} is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: \href{this https URL}{s2n-bignum-bench}.

顶级标签: llm theory benchmark
详细标签: theorem proving formal verification cryptographic assembly proof synthesis hol light 或 搜索:

s2n-bignum-bench:一个用于评估大语言模型底层代码推理能力的实用基准 / s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs


1️⃣ 一句话总结

这篇论文提出了一个名为s2n-bignum-bench的新基准,它基于一个工业级密码库的已验证汇编代码,用于测试大语言模型能否像人类专家一样,为真实的底层程序自动生成能被形式化工具接受的证明,从而评估其超越纯数学竞赛的实际推理能力。

源自 arXiv: 2603.14628