From Reviews to Requirements: Can LLMs Generate Human-Like User Stories?

📄 Abstract - From Reviews to Requirements: Can LLMs Generate Human-Like User Stories?

App store reviews provide a constant flow of real user feedback that can help improve software requirements. However, these reviews are often messy, informal, and difficult to analyze manually at scale. Although automated techniques exist, many do not perform well when replicated and often fail to produce clean, backlog-ready user stories for agile projects. In this study, we evaluate how well large language models (LLMs) such as GPT-3.5 Turbo, Gemini 2.0 Flash, and Mistral 7B Instruct can generate usable user stories directly from raw app reviews. Using the Mini-BAR dataset of 1,000+ health app reviews, we tested zero-shot, one-shot, and two-shot prompting methods. We evaluated the generated user stories using both human judgment (via the RUST framework) and a RoBERTa classifier fine-tuned on UStAI to assess their overall quality. Our results show that LLMs can match or even outperform humans in writing fluent, well-formatted user stories, especially when few-shot prompts are used. However, they still struggle to produce independent and unique user stories, which are essential for building a strong agile backlog. Overall, our findings show how LLMs can reliably turn unstructured app reviews into actionable software requirements, providing developers with clear guidance to turn user feedback into meaningful improvements.

从评论到需求：大语言模型能生成类人的用户故事吗？ / From Reviews to Requirements: Can LLMs Generate Human-Like User Stories?

1️⃣ 一句话总结

这篇论文研究发现，大语言模型能够有效地将应用商店中杂乱无章的用户评论自动转换成格式规范、可执行的软件需求（用户故事），其流畅性甚至可与人类媲美，但在生成独立且不重复的需求方面仍有不足。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要