基于不确定性的Transformer视觉跟踪器推理时深度自适应方法 / Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking
1️⃣ 一句话总结
这篇论文提出了一种名为UncL-STARK的智能方法,能让基于Transformer的视觉跟踪器在视频处理时根据画面复杂程度动态调整计算深度,从而在几乎不影响跟踪精度的前提下,显著降低计算量、延迟和能耗。
Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder--decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model's corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12\% GFLOPs reduction, 8.9\% latency reduction, and 10.8\% energy savings while maintaining tracking accuracy within 0.2\% of the full-depth baseline across both short-term and long-term sequences.
基于不确定性的Transformer视觉跟踪器推理时深度自适应方法 / Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking
这篇论文提出了一种名为UncL-STARK的智能方法,能让基于Transformer的视觉跟踪器在视频处理时根据画面复杂程度动态调整计算深度,从而在几乎不影响跟踪精度的前提下,显著降低计算量、延迟和能耗。
源自 arXiv: 2602.16160