在空间与时空操作下人机在以自我为中心动作识别中的差异 / Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations
1️⃣ 一句话总结
这项研究发现,在识别视频中的动作时,人类主要依赖关键的手-物交互等语义线索,而AI模型则更依赖上下文和中低层视觉特征,导致在图像被裁剪或时间顺序被打乱时,两者的识别表现和策略存在显著差异。
Humans consistently outperform state-of-the-art AI models in action recognition, particularly in challenging real-world conditions involving low resolution, occlusion, and visual clutter. Understanding the sources of this performance gap is essential for developing more robust and human-aligned models. In this paper, we present a large-scale human-AI comparative study of egocentric action recognition using Minimal Identifiable Recognition Crops (MIRCs), defined as the smallest spatial or spatiotemporal regions sufficient for reliable human recognition. We used our previously introduced, Epic ReduAct, a systematically spatially reduced and temporally scrambled dataset derived from 36 EPIC KITCHENS videos, spanning multiple spatial reduction levels and temporal conditions. Recognition performance is evaluated using over 3,000 human participants and the Side4Video model. Our analysis combines quantitative metrics, Average Reduction Rate and Recognition Gap, with qualitative analyses of spatial (high-, mid-, and low-level visual features) and spatiotemporal factors, including a categorisation of actions into Low Temporal Actions (LTA) and High Temporal Actions (HTA). Results show that human performance exhibits sharp declines when transitioning from MIRCs to subMIRCs, reflecting a strong reliance on sparse, semantically critical cues such as hand-object interactions. In contrast, the model degrades more gradually and often relies on contextual and mid- to low-level features, sometimes even exhibiting increased confidence under spatial reduction. Temporally, humans remain robust to scrambling when key spatial cues are preserved, whereas the model often shows insensitivity to temporal disruption, revealing class-dependent temporal sensitivities.
在空间与时空操作下人机在以自我为中心动作识别中的差异 / Human-AI Divergence in Ego-centric Action Recognition under Spatial and Spatiotemporal Manipulations
这项研究发现,在识别视频中的动作时,人类主要依赖关键的手-物交互等语义线索,而AI模型则更依赖上下文和中低层视觉特征,导致在图像被裁剪或时间顺序被打乱时,两者的识别表现和策略存在显著差异。
源自 arXiv: 2603.08317