菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-24
📄 Abstract - Automatic Stability and Recovery for Neural Network Training

Training modern neural networks is increasingly fragile, with rare but severe destabilizing updates often causing irreversible divergence or silent performance degradation. Existing optimization methods primarily rely on preventive mechanisms embedded within the optimizer, offering limited ability to detect and recover from instability once it occurs. We introduce a supervisory runtime stability framework that treats optimization as a controlled stochastic process. By isolating an innovation signal derived from secondary measurements, such as validation probes, the framework enables automatic detection and recovery from destabilizing updates without modifying the underlying optimizer. We provide theoretical runtime safety guarantees that formalize bounded degradation and recovery. Our implementation incurs minimal overhead and is compatible with memory-constrained training settings.

顶级标签: model training machine learning systems
详细标签: training stability optimization runtime safety neural networks recovery mechanism 或 搜索:

神经网络训练的自动稳定性与恢复 / Automatic Stability and Recovery for Neural Network Training


1️⃣ 一句话总结

这篇论文提出了一种在神经网络训练过程中自动监控、检测并从中断性错误中恢复的运行时框架,无需修改原有优化器,从而保证了训练过程的稳定性和安全性。

源自 arXiv: 2601.17483