SurgOnAir:具有层级感知能力的实时手术视频解说 / SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary
1️⃣ 一句话总结
该论文提出了一种名为SurgOnAir的流式视觉语言模型,它能像直播解说一样,实时逐帧分析手术视频,并同步生成从动作、步骤到阶段的多层级文字描述,从而让AI系统能即时感知并响应手术过程中的细微变化与关键转折。
Understanding surgical workflow in real time is fundamental for intelligent surgical embodiment, where AI systems continuously perceive and respond as surgery proceeds. In the operating room, critical decisions depend on subtle, moment-to-moment changes, such as fine instrument movements and evolving tissue states, where even slight perceptual delays can limit assistance or compromise safety. Yet existing methods remain offline or operate at coarse temporal scales, generating descriptions only after processing clips, preventing immediate reaction. We address this by proposing SurgOnAir, a streaming vision-language model that processes frames sequentially without future access and progressively generates narration tokens as visual input arrives. SurgOnAir achieves fine-grained frame-to-token generation, enabling instant responsiveness to evolving surgical dynamics. Built upon our curated hierarchical dataset SurgOnAir-11k spanning action-, step-, and phase-level supervision, the model is trained to produce multi-level textual responses that reflect the inherent hierarchy of surgical procedures. Furthermore, special transition tokens are generated to explicitly mark state changes, allowing SurgOnAir to capture and signal key workflow transitions as they occur. Experiments show that SurgOnAir enables real-time understanding through a single vision-language model that unifies streaming across multiple hierarchies of the surgical workflow, generating superior and hierarchy-aware narrations. Code and dataset will be public.
SurgOnAir:具有层级感知能力的实时手术视频解说 / SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary
该论文提出了一种名为SurgOnAir的流式视觉语言模型,它能像直播解说一样,实时逐帧分析手术视频,并同步生成从动作、步骤到阶段的多层级文字描述,从而让AI系统能即时感知并响应手术过程中的细微变化与关键转折。
源自 arXiv: 2605.21132