菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-26
📄 Abstract - SafeMath: Inference-time Safety improves Math Accuracy

Recent research points toward LLMs being manipulated through adversarial and seemingly benign inputs, resulting in harmful, biased, or policy-violating outputs. In this paper, we study an underexplored issue concerning harmful and toxic mathematical word problems. We show that math questions, particularly those framed as natural language narratives, can serve as a subtle medium for propagating biased, unethical, or psychologically harmful content, with heightened risks in educational settings involving children. To support a systematic study of this phenomenon, we introduce ToxicGSM, a dataset of 1.9k arithmetic problems in which harmful or sensitive context is embedded while preserving mathematically well-defined reasoning tasks. Using this dataset, we audit the behaviour of existing LLMs and analyse the trade-offs between safety enforcement and mathematical correctness. We further propose SafeMath -- a safety alignment technique that reduces harmful outputs while maintaining, and in some cases improving, mathematical reasoning performance. Our results highlight the importance of disentangling linguistic harm from math reasoning and demonstrate that effective safety alignment need not come at the cost of accuracy. We release the source code and dataset at this https URL.

顶级标签: llm model evaluation natural language processing
详细标签: safety alignment mathematical reasoning harmful content dataset inference-time intervention 或 搜索:

SafeMath:推理时安全提升数学准确性 / SafeMath: Inference-time Safety improves Math Accuracy


1️⃣ 一句话总结

这篇论文发现,以自然语言故事形式呈现的数学题可能隐含偏见、不道德或有害内容,作者为此创建了一个包含有害场景的数学数据集,并提出了一种名为SafeMath的安全对齐技术,能在减少有害输出的同时保持甚至提升模型的数学解题能力。

源自 arXiv: 2603.25201