📄
Abstract - Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR
Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromised or misaligned. We identify an overlooked optimization issue in multi-layer feature fusion. Skip pathways introduce direct back-propagation paths from high-level semantic objectives to early visual layers. This mechanism overwrites low-level signals and destabilizes training. To mitigate this gradient interference, we propose Detached Skip-Links, a minimal modification that reuses shallow features in the forward pass while stopping gradients through the skip branch during joint training. This asymmetric design reduces gradient interference, improving stability and convergence without adding learnable parameters. To diagnose whether fine-grained information is preserved and usable by an LLM, we introduce $R$-Probe, which measures pixel-level reconstructability of projected visual tokens using a shallow decoder initialized from the first quarter of the LLM layers. Across multiple ViT backbones and multimodal benchmarks, and at scales up to 7M training samples, our approach consistently improves OCR-centric benchmarks and delivers clear gains on general multimodal tasks.
解耦跳跃连接与R-Probe:为多模态大语言模型OCR任务解耦特征聚合与梯度传播 /
Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR
1️⃣ 一句话总结
这篇论文发现并解决了多模态大模型在OCR任务中的一个关键训练问题:传统特征融合方法中的梯度干扰会破坏底层视觉细节,为此提出了一种在训练时阻断跳跃连接梯度传播的简单有效方法,并设计了一个诊断工具来验证模型是否保留了精细视觉信息,从而显著提升了OCR及相关多模态任务的性能。