SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

📄 Abstract - SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at this https URL.

SPA：一个用于知识注入的简单但难以超越的基线方法 / SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

1️⃣ 一句话总结

这篇论文提出了一个名为SPA的简单方法，它通过精心设计少量提示词来生成大规模合成数据，从而有效地为大型语言模型注入特定领域的知识，其效果优于多种复杂方法，为后续研究提供了一个强有力的基线。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要