菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-11
📄 Abstract - OR-Action: Multi-Role Video Understanding with Fine-Grained Actions

Fine-grained understanding of operating room (OR) activity could enable workflow-aware assistance, yet remains difficult due to clutter, occlusions, and limited sensing. The prevailing approach to model this environment is scene graphs as an interpretable representation of OR interactions. Converting their frame-wise relational predictions into temporally extended, fine-grained actions however, is challenging without explicit temporal modeling. To enable a principled temporal evaluation of current OR understanding methods, we introduce the first action-centric benchmark built on a publicly available ego-exocentric OR dataset by defining a fine-grained, multi-role action taxonomy and generating dense action segments via distillation from ground-truth scene graph state changes. Experiments on this benchmark show that current scene graph prediction methods struggle to model temporal structure, even when adding explicit modeling through Graph Neural Networks. We therefore introduce a vision-only temporal model that outperforms graph-based methods significantly when using all available egocentric video as input. Building on this model we also introduce a novel multi- to single-view feature alignment strategy that improves single-view performance on multi-role action recognition, mitigating the need for extensive egocentric video capture. Benchmark and code will be released upon acceptance.

顶级标签: computer vision benchmark
详细标签: action recognition scene graph operating room multi-view alignment temporal reasoning 或 搜索:

OR-Action:带细粒度动作的多角色手术室视频理解 / OR-Action: Multi-Role Video Understanding with Fine-Grained Actions


1️⃣ 一句话总结

本文提出了一种针对手术室视频的细粒度多角色动作识别方法,通过构建首个以动作为中心的基准数据集和一种仅依赖视觉的时序模型,显著提升了在复杂遮挡和有限视角下的动作理解能力,并引入多视角到单视角的特征对齐技术,减少了对多摄像头数据的依赖。

源自 arXiv: 2606.13332