菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-09
📄 Abstract - Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification

In recent years, video-based person Re-Identification (ReID) has gained attention for its ability to leverage spatiotemporal cues to match individuals across non-overlapping cameras. However, current methods struggle with high-difficulty scenarios, such as sports and dance performances, where multiple individuals wear similar clothing while performing dynamic movements. To overcome these challenges, we propose CG-CLIP, a novel caption-guided CLIP framework that leverages explicit textual descriptions and learnable tokens. Our method introduces two key components: Caption-guided Memory Refinement (CMR) and Token-based Feature Extraction (TFE). CMR utilizes captions generated by Multi-modal Large Language Models (MLLMs) to refine identity-specific features, capturing fine-grained details. TFE employs a cross-attention mechanism with fixed-length learnable tokens to efficiently aggregate spatiotemporal features, reducing computational overhead. We evaluate our approach on two standard datasets (MARS and iLIDS-VID) and two newly constructed high-difficulty datasets (SportsVReID and DanceVReID). Experimental results demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements across all benchmarks.

顶级标签: computer vision multi-modal model training
详细标签: person re-identification video retrieval clip framework caption guidance spatiotemporal features 或 搜索:

超越行人:面向高难度视频行人重识别的描述引导CLIP框架 / Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification


1️⃣ 一句话总结

这篇论文提出了一个名为CG-CLIP的新方法,它利用文本描述和可学习标记,有效解决了在体育、舞蹈等高难度场景下,因人物着装相似且动作复杂而导致的视频行人重识别难题,并在多个数据集上取得了领先的性能。

源自 arXiv: 2604.07740