Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

📄 Abstract - Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-specific diagnostics, actionable troubleshooting, and safety-critical decisions, where incorrect advice can cause device damage, battery hazards, or permanent data loss. We introduce a benchmark of 991 real-world repair questions from Reddit spanning phone repair, computer repair, and data recovery, each paired with technician-written reference solutions, and provide Bangla translations to evaluate cross-lingual performance. We evaluate six state-of-the-art LLMs in English and Bangla using four repair-specific criteria: correctness, completeness, practicality, and safety. Our results show that while LLMs can provide useful repair assistance, they remain unreliable for high-risk real-world repair tasks without rigorous evaluation and explicit safety safeguards. Phone repair is the most difficult and safety-sensitive domain, and all models make substantial errors in board-level diagnosis, repair prioritization, and safe recovery procedures. Across domains and models, Bangla responses consistently perform worse than English responses. Among the evaluated models, GPT-5.4 performs best overall.

评估大语言模型在真实世界消费设备维修问题上的有效性 / Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

1️⃣ 一句话总结

这篇论文构建了一个包含991个真实维修问题的基准测试集，考察了GPT-5.4等六个主流大语言模型在手机、电脑维修及数据恢复场景中的表现，发现尽管模型能提供有用建议，但在高风险、需安全判断的硬件级诊断和维修顺序等任务上仍不可靠，且英文回答明显优于孟加拉语回答。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要