菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-08
📄 Abstract - SPINAL -- Scaling-law and Preference Integration in Neural Alignment Layers

Direct Preference Optimization (DPO) is a principled, scalable alternative to RLHF for aligning large language models from pairwise preferences, but its internal geometric footprint remains undercharacterized, limiting audits, checkpoint comparisons, and failure prediction. We introduce SPINAL (Scaling-law and Preference Integration in Neural Alignment Layers), a diagnostic that measures how alignment reshapes representations across depth by tracing localized structural change layer by layer. Across model families, DPO produces a layerwise calibration effect concentrated in the final decoder blocks (often layers 21-30), where preference gradients most directly affect the next-token distribution. SPINAL encodes each checkpoint as a depth trace over (layer index, contraction score, transport score). The contraction score summarizes how quickly the tail of a layer's spectrum decays (how fast small modes vanish); higher values indicate stronger contraction into fewer effective directions. The transport score summarizes how much the token distribution shifts between adjacent layers using a bounded overlap measure; lower values indicate shorter, smoother steps through representation space. Aligned checkpoints show a late-layer ramp-up in contraction and a smooth reduction in transport, consistent with tightened and stabilized policy mass, while unaligned models trace higher-curvature, more entropic, and geometrically incoherent depth paths. Overall, alignment is geometrically localized: the final layers encode the dominant preference-induced corrections. SPINAL turns this localization into a practical audit signal, quantifying where alignment concentrates, how strongly it manifests, and when it begins to destabilize during training.

顶级标签: llm model training model evaluation
详细标签: direct preference optimization representation geometry layerwise analysis alignment diagnostics scaling laws 或 搜索:

SPINAL——神经对齐层中的缩放定律与偏好整合 / SPINAL -- Scaling-law and Preference Integration in Neural Alignment Layers


1️⃣ 一句话总结

这篇论文提出了一种名为SPINAL的诊断工具,通过逐层分析模型内部几何结构的变化,揭示了直接偏好优化(DPO)对齐大语言模型时,其核心作用主要集中在模型最后的几层,使得模型的输出更集中、更稳定,从而为模型对齐过程提供了一个可量化的审计信号。

源自 arXiv: 2601.06238