MTR-Suite:一个用于评估和合成对话检索基准的框架 / MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks
1️⃣ 一句话总结
本文提出了MTR-Suite框架,通过大语言模型驱动的自动审计工具和低成本的对话生成系统,解决了现有对话检索基准中人工标注昂贵、自动化数据不自然的问题,并构建了一个更具区分力的通用基准测试集。
Accurate evaluation of conversational retrieval is pivotal for advancing Retrieval-Augmented Generation (RAG) systems. However, existing conversational retrieval benchmarks suffer from costly, sparse human annotation or rigid, unnatural automated heuristics. To address these challenges, we introduce MTR-Suite, a unified framework for auditing, synthesizing, and benchmarking retrieval. It features: (1) MTR-Eval, an LLM-based auditor quantifying alignment gaps in previous benchmarks; (2) MTR-Pipeline, a multi-agent system using greedy traversal clustering to generate high-fidelity dialogues at 1/400th human cost; and (3) MTR-Bench, a rigorous general-domain benchmark. MTR-Bench mimics production-style challenges (hard topic switching, verbosity), offering superior discriminative power. We make our code and data publicly available to facilitate future research at this https URL.
MTR-Suite:一个用于评估和合成对话检索基准的框架 / MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks
本文提出了MTR-Suite框架,通过大语言模型驱动的自动审计工具和低成本的对话生成系统,解决了现有对话检索基准中人工标注昂贵、自动化数据不自然的问题,并构建了一个更具区分力的通用基准测试集。
源自 arXiv: 2605.20729