菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-13
📄 Abstract - Disposition Distillation at Small Scale: A Three-Arc Negative Result

We set out to train behavioral dispositions (self-verification, uncertainty acknowledgment, feedback integration) into small language models (0.6B to 2.3B effective parameters) through a four-stage all-MIT distillation pipeline, with follow-on experiments on inference-time attention-head interventions and a frozen-base confidence-gated sidecar. An internal draft reported +33.9-point MCAS and +15.3-point HumanEval gains on a Qwen3-0.6B student; a second-pass sanity check falsified both numbers before publication. The HumanEval delta was a truncation artifact (n_predict=512) that inverted to -8.0 points at n_predict=1024; the MCAS gain disappeared under apples-to-apples scoring. That falsification triggered three subsequent arcs. Across (1) SFT/DPO LoRA on three model families and two domains, (2) inference-time attention-head tempering on o_proj, and (3) a training-free frozen-base sidecar reading the final-token hidden state h_last, we find no operator that moves judge-measured disposition without damaging content or collapsing into stylistic mimicry. The failure is consistent across five models (Qwen3-0.6B, Qwen3-1.7B, Qwen3.5-0.8B, Gemma 4 E2B, and SmolLM2-1.7B-Instruct). A within-distribution cross-validation pass (AUC=0.683) collapsed to chance on fresh prompts (AUC=0.516). We contribute a three-arc negative result with mechanism, a two-failure-mode taxonomy for linear h_last probes, and an honest falsification pipeline that converts the class of false positives we ourselves produced into publishable negatives. As an independent finding, Gemma 4 E2B exhibits near-complete confidence-correctness decoupling on the Chef domain (assertion asymmetry -0.009; the model asserts at 91% regardless of correctness).

顶级标签: llm model training model evaluation
详细标签: behavioral distillation negative result small language models attention intervention confidence calibration 或 搜索:

小规模模型的行为特质蒸馏:一个包含三个研究路径的负面结果 / Disposition Distillation at Small Scale: A Three-Arc Negative Result


1️⃣ 一句话总结

这篇论文通过一系列严谨的实验发现,试图将‘自我验证’、‘承认不确定性’等行为特质‘蒸馏’到小型语言模型中的多种方法均告失败,这些方法要么损害模型内容质量,要么只是让模型学会了模仿风格,无法真正提升其内在的行为特质。

源自 arXiv: 2604.11867