超越检索:面向代码搜索的多任务基准与模型 / Beyond Retrieval: A Multitask Benchmark and Model for Code Search
1️⃣ 一句话总结
本文提出了一个名为CoREB的多任务基准测试和微调的重排序模型,用于解决现有代码搜索基准存在的数据污染和标签噪声问题,实验表明该模型能在文本到代码、代码到文本和代码到代码三个任务上首次实现一致性的性能提升。
Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textsc{CoREB}, a contamination-limited, multitask \underline{co}de \underline{r}etrieval and r\underline{e}ranking \underline{b}enchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textsc{CoREB} is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval (${\sim}2{\times}$ over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned \textsc{CoREB-Reranker} is the first to achieve consistent gains across all three tasks. The data and model are released.
超越检索:面向代码搜索的多任务基准与模型 / Beyond Retrieval: A Multitask Benchmark and Model for Code Search
本文提出了一个名为CoREB的多任务基准测试和微调的重排序模型,用于解决现有代码搜索基准存在的数据污染和标签噪声问题,实验表明该模型能在文本到代码、代码到文本和代码到代码三个任务上首次实现一致性的性能提升。
源自 arXiv: 2605.04615