菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-02
📄 Abstract - Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-specific diagnostics, actionable troubleshooting, and safety-critical decisions, where incorrect advice can cause device damage, battery hazards, or permanent data loss. We introduce a benchmark of 991 real-world repair questions from Reddit spanning phone repair, computer repair, and data recovery, each paired with technician-written reference solutions, and provide Bangla translations to evaluate cross-lingual performance. We evaluate six state-of-the-art LLMs in English and Bangla using four repair-specific criteria: correctness, completeness, practicality, and safety. Our results show that while LLMs can provide useful repair assistance, they remain unreliable for high-risk real-world repair tasks without rigorous evaluation and explicit safety safeguards. Phone repair is the most difficult and safety-sensitive domain, and all models make substantial errors in board-level diagnosis, repair prioritization, and safe recovery procedures. Across domains and models, Bangla responses consistently perform worse than English responses. Among the evaluated models, GPT-5.4 performs best overall.

顶级标签: llm model evaluation benchmark
详细标签: repair assistance safety cross-lingual consumer electronics troubleshooting 或 搜索:

评估大语言模型在真实世界消费设备维修问题上的有效性 / Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions


1️⃣ 一句话总结

这篇论文构建了一个包含991个真实维修问题的基准测试集,考察了GPT-5.4等六个主流大语言模型在手机、电脑维修及数据恢复场景中的表现,发现尽管模型能提供有用建议,但在高风险、需安全判断的硬件级诊断和维修顺序等任务上仍不可靠,且英文回答明显优于孟加拉语回答。

源自 arXiv: 2606.03331