菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-22
📄 Abstract - SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark

Fine-grained spatiotemporal reasoning on surgical videos is critical, yet the capabilities of Multi-modal Large Language Models (MLLMs) in this domain remain largely unexplored. To bridge this gap, we introduce SurgCoT, a unified benchmark for evaluating chain-of-thought (CoT) reasoning in MLLMs across 7 surgical specialties and 35 diverse procedures. SurgCoT assesses five core reasoning dimensions: Causal Action Ordering, Cue-Action Alignment, Affordance Mapping, Micro-Transition Localization, and Anomaly Onset Tracking, through a structured CoT framework with an intensive annotation protocol (Question-Option-Knowledge-Clue-Answer), where the Knowledge field provides essential background context and Clue provides definitive spatiotemporal evidence. Evaluation of 10 leading MLLMs shows: 1) commercial models outperform open-source and medical-specialized variants; 2) significant gaps exist in surgical CoT reasoning; 3) SurgCoT enables effective evaluation and enhances progressive spatiotemporal reasoning. SurgCoT provides a reproducible testbed to narrow the gap between MLLM capabilities and clinical reasoning demands. Code: this https URL.

顶级标签: medical multi-modal benchmark
详细标签: surgical video spatiotemporal reasoning chain-of-thought evaluation multi-modal llm 或 搜索:

SurgCoT:通过思维链基准推动手术视频中的时空推理 / SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark


1️⃣ 一句话总结

该论文提出了SurgCoT,一个专门用于评估多模态大语言模型在手术视频中进行时空推理能力的基准数据集,通过结构化思维链框架和精细标注,揭示了当前模型在因果推理、动作对齐等关键维度上的显著不足。

源自 arXiv: 2604.20319