菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-05-18
📄 Abstract - TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?

While Large Language Models have achieved remarkable integration in various vertical scenarios, their deployment in the telecommunications domain remains exploratory due to the lack of a standardized evaluation framework. Current telecom benchmarks primarily focus on static, foundational knowledge and isolated atomic skills, neglecting the equipment-specific documentation and end-to-end industrial workflows essential for real-world production systems. To bridge this gap, we present TeleCom-Bench, a comprehensive benchmark comprising 12 evaluation sets with 22,678 curated samples, which evaluates LLMs across a synergistic hierarchy: (1) Multi-dimensional Knowledge Comprehension, which integrates telecommunication fundamentals, 3GPP protocols, and 5G network architecture with proprietary product knowledge across wired, core, and wireless networks via knowledge graph-driven synthesis; and (2)End-to-End Knowledge Application, which formalizes six core tasks on authentic trajectories from live network agent workflows, including intent recognition, entity extraction, event verification, tool invocation, root cause analysis, and solution generation-across network optimization and fault maintenance scenarios. Evaluations of eight state-of-the-art LLMs reveal a universal Execution Wall: while models achieve 90% accuracy in linguistic interface tasks such as intent recognition and entity extraction, performance collapses to approximately 30% in procedural execution tasks like solution generation. This capability gap demonstrates that current LLMs function competently as diagnosticians but fail as field engineers. TeleCom-Bench provides standardized diagnostics to precisely pinpoint this deficit, offering actionable guidance for domain-specific alignment toward production-ready telecom agents. The dataset and evaluation code have been released at this https URL.

顶级标签: llm benchmark systems
详细标签: telecommunication industrial application knowledge graph task evaluation agent workflow 或 搜索:

TeleCom-Bench:大型语言模型距离工业电信应用还有多远? / TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?


1️⃣ 一句话总结

本文提出了一个名为TeleCom-Bench的全面评估基准,包含超过2.2万个样本,通过测试大模型在电信知识理解和实际工作流程执行(如故障诊断和解决方案生成)两个层面的能力,发现当前模型在简单语言任务上准确率可达90%,但在复杂的执行性任务上准确率骤降至约30%,表明它们擅长“诊断”却无法胜任“现场工程师”的角色。

源自 arXiv: 2605.18025