解耦表示动态网络:用于从头训练ViT的类别增量学习 / DRDN: Decoupled Representation Dynamic Network for From-Scratch ViT Class-Incremental Learning
1️⃣ 一句话总结
该论文提出了一种名为DRDN的方法,通过两阶段解耦策略——使用掩码图像建模来保持骨干网络的通用视觉特征,并采用层级任务令牌扩展减少新旧任务间的冲突,显著提升了视觉Transformer在不依赖外部预训练情况下进行类别增量学习的性能。
Dynamic expansion methods for class-incremental learning (CIL) protect task-specific knowledge by growing dedicated tokens or subnetworks, yet our analyses suggest that classification supervision alone does not sufficiently preserve task-agnostic shared backbone representations over long incremental sequences. We identify two intertwined challenges: cross-task confusion from sequential training on predominantly current-task data, which biases decision boundaries toward recent tasks; and under-optimized shared representations in the backbone that cap long-term discriminability as tasks accumulate. We propose the Decoupled Representation Dynamic Network (DRDN), which addresses these challenges via two orthogonal mechanisms. For shared backbone representations, DRDN continuously applies masked image modeling (MIM) at every incremental step, with reconstruction gradients routed exclusively through the backbone, encouraging it to retain general visual structure beyond class-discriminative cues. For task-specific discrimination, DRDN employs hierarchical task token expansion across all transformer layers, with a modified per-task attention rule that reduces inter-task interference. We support this design with accuracy degradation analysis and cross-task confusion rate measurements. In the from-scratch ViT CIL setting (no external pretraining), DRDN consistently improves over strong token-expansion baselines with comparable backbone scale. On CIFAR100-B0 (10 steps), DRDN achieves 77.19% average accuracy, outperforming DKT by 1.36 points and DyTox by 3.53 points, with an advantage that grows at longer incremental sequences. Multi-seed validation confirms stability (+/-0.31%). The MIM decoder is active only during training, adding no inference-time parameters or computation.
解耦表示动态网络:用于从头训练ViT的类别增量学习 / DRDN: Decoupled Representation Dynamic Network for From-Scratch ViT Class-Incremental Learning
该论文提出了一种名为DRDN的方法,通过两阶段解耦策略——使用掩码图像建模来保持骨干网络的通用视觉特征,并采用层级任务令牌扩展减少新旧任务间的冲突,显著提升了视觉Transformer在不依赖外部预训练情况下进行类别增量学习的性能。
源自 arXiv: 2607.01630