菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-03-04
📄 Abstract - Scaling Dense Event-Stream Pretraining from Visual Foundation Models

Learning versatile, fine-grained representations from irregular event streams is pivotal yet nontrivial, primarily due to the heavy annotation that hinders scalability in dataset size, semantic richness, and application scope. To mitigate this dilemma, we launch a novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale. Specifically, we curate an extensive synchronized image-event collection to amplify cross-modal alignment. Nevertheless, due to inherent mismatches in sparsity and granularity between image-event domains, existing distillation paradigms are prone to semantic collapse in event representations, particularly at high resolutions. To bridge this gap, we propose to extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and stronger supervision. The key ingredient of our method is a structure-aware distillation loss that grounds higher-quality image-event correspondences for alignment, optimizing dense event representations. Extensive experiments demonstrate that our approach takes a great leap in downstream benchmarks, significantly surpassing traditional methods and existing pretraining techniques. This breakthrough manifests in enhanced generalization, superior data efficiency and elevated transferability.

顶级标签: computer vision model training multi-modal
详细标签: event streams self-supervised learning knowledge distillation representation learning cross-modal alignment 或 搜索:

从视觉基础模型扩展密集事件流预训练 / Scaling Dense Event-Stream Pretraining from Visual Foundation Models


1️⃣ 一句话总结

这篇论文提出了一种新的自监督预训练方法,通过从视觉基础模型中提取语义结构信息来指导事件流数据的学习,从而显著提升了事件表征的质量及其在下游任务中的性能。

源自 arXiv: 2603.03969