EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild

📄 Abstract - EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild

Binary Function Similarity Detection (BFSD) is a core problem in software security, supporting tasks such as vulnerability analysis, malware classification, and patch provenance. In the past few decades, numerous models and tools have been developed for this application; however, due to the lack of a comprehensive universal benchmark in this field, researchers have struggled to compare different models effectively. Existing datasets are limited in scope, often focusing on a narrow set of transformations or types of binaries, and fail to reflect the full diversity of real-world applications. We introduce EXHIB, a benchmark comprising five realistic datasets collected from the wild, each highlighting a distinct aspect of the BFSD problem space. We evaluate 9 representative models spanning multiple BFSD paradigms on EXHIB and observe performance degradations of up to 30% on firmware and semantic datasets compared to standard settings, revealing substantial generalization gaps. Our results show that robustness to low- and mid-level binary variations does not generalize to high-level semantic differences, underscoring a critical blind spot in current BFSD evaluation practices.

EXHIB：一个用于在真实复杂场景下评估函数相似性的现实且多样化的基准 / EXHIB: A Benchmark for Realistic and Diverse Evaluation of Function Similarity in the Wild

1️⃣ 一句话总结

这篇论文提出了一个名为EXHIB的新基准，它通过五个真实数据集全面评估二进制函数相似性检测模型，发现现有模型在面对现实世界软件的多样性时性能会大幅下降，揭示了当前评估方法的重大缺陷。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要