机器人辅助手术中的实时多模态活动感知错误检测 / Real-Time Multimodal Activity-Aware Error Detection in Robot-Assisted Surgery
1️⃣ 一句话总结
该论文提出了一种结合视频、运动数据和文字描述的统一框架,通过活动提示和视觉嵌入技术,显著提升了机器人辅助手术中技术错误检测的准确率,在两项公开数据集上分别将F1分数提升了5%和16.6%。
Robot-assisted minimally invasive surgery improves surgical precision but introduces complexity, making technical error detection essential for ensuring patient safety. Current executional error detection methods using video data often overlook fine-grained contextual descriptions of activities and error types within the hierarchical structure of surgical procedures. They also under-utilize complementary multimodal information. We propose a unified framework for executional error detection that leverages multimodal input, including video, kinematics, and descriptive textual prompts. Through activity prompting, we integrate descriptive language in gesture-level activities, instrument-object interactions, and error definitions. We also introduce activity-aware visual embeddings derived from vision encoders pretrained on surgical activity labels to compare the effectiveness of contrastive language-image embeddings with traditional image-based embeddings for error detection. By seamlessly integrating kinematic data with video and textual modalities, our framework significantly improves error detection performance. Achieving up to 5\% and 16.6\% F1 score improvements over state-of-the-art baselines on the JIGSAWS and SAR-RARP50 datasets, respectively, we demonstrate the value of combining curated textual prompts with multimodal data for accurate error detection.
机器人辅助手术中的实时多模态活动感知错误检测 / Real-Time Multimodal Activity-Aware Error Detection in Robot-Assisted Surgery
该论文提出了一种结合视频、运动数据和文字描述的统一框架,通过活动提示和视觉嵌入技术,显著提升了机器人辅助手术中技术错误检测的准确率,在两项公开数据集上分别将F1分数提升了5%和16.6%。
源自 arXiv: 2606.23593