量化陷阱:打破多步推理中的线性缩放定律 / The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning
1️⃣ 一句话总结
这篇论文发现,在处理需要多步推理的复杂任务时,简单地降低AI模型的计算精度(如从16位降到8位或4位)不仅不会节省能耗,反而会因为硬件转换开销和去量化延迟成为瓶颈,导致总能耗增加和推理准确性下降,从而打破了业界普遍认为的‘精度越低、效率越高’的线性缩放定律。
Neural scaling laws provide a predictable recipe for AI advancement: reducing numerical precision should linearly improve computational efficiency and energy profile (E proportional to bits). In this paper, we demonstrate that this scaling law breaks in the context of multi-hop reasoning. We reveal a 'quantization trap' where reducing precision from 16-bit to 8/4-bit paradoxically increases more net energy consumption while degrading reasoning accuracy. We provide a rigorous theoretical decomposition that attributes this failure to hardware casting overhead, the hidden latency cost of dequantization kernels, which becomes a dominant bottleneck in sequential reasoning chains, as well as to a sequential energy amortization failure. As a result, scaling law breaking is unavoidable in practice. Our findings suggest that the industry's "smaller-is-better" heuristic is mathematically counterproductive for complex reasoning tasks.
量化陷阱:打破多步推理中的线性缩放定律 / The Quantization Trap: Breaking Linear Scaling Laws in Multi-Hop Reasoning
这篇论文发现,在处理需要多步推理的复杂任务时,简单地降低AI模型的计算精度(如从16位降到8位或4位)不仅不会节省能耗,反而会因为硬件转换开销和去量化延迟成为瓶颈,导致总能耗增加和推理准确性下降,从而打破了业界普遍认为的‘精度越低、效率越高’的线性缩放定律。
源自 arXiv: 2602.13595