菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-13
📄 Abstract - MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness

Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, underscoring the need for principled evaluation of so-called user proxy agents. We present MIRRORBENCH, a reproducible, extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational tasks, explicitly decoupled from downstream task success. MIRRORBENCH features a modular execution engine with typed interfaces, metadata-driven registries, multi-backend support, caching, and robust observability. The system supports pluggable user proxies, datasets, tasks, and metrics, enabling researchers to evaluate arbitrary simulators under a uniform, variance-aware harness. We include three lexical-diversity metrics (MATTR, YULE'S K, and HD-D) and three LLM-judge-based metrics (GTEval, Pairwise Indistinguishability, and Rubric-and-Reason). Across four open datasets, MIRRORBENCH yields variance-aware results and reveals systematic gaps between user proxies and real human users. The framework is open source and includes a simple command-line interface for running experiments, managing configurations and caching, and generating reports. The framework can be accessed at this https URL.

顶级标签: llm agents benchmark
详细标签: user simulation evaluation framework human-likeness conversational ai llm evaluation 或 搜索:

MirrorBench:一个用于评估用户代理人类相似性的可扩展框架 / MirrorBench: An Extensible Framework to Evaluate User-Proxy Agents for Human-Likeness


1️⃣ 一句话总结

这篇论文提出了一个名为MirrorBench的可扩展评测框架,专门用于评估大语言模型作为用户代理时,其生成的对话内容在多大程度上像真人说话,而不是只关注任务完成度。

源自 arXiv: 2601.08118