大语言模型生成的文本能否赋能外科手术视觉-语言预训练? / Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?
1️⃣ 一句话总结
这篇论文提出了一个名为SurgLIME的新方法,它利用大语言模型自动生成的手术视频描述文本(而非昂贵的人工标注)来训练视觉-语言模型,并通过创新的技术手段有效过滤文本中的错误信息,从而在降低标注成本的同时,保证了模型对外科手术视频的理解和推理能力。
Recent advancements in self-supervised learning have led to powerful surgical vision encoders capable of spatiotemporal understanding. However, extending these visual foundations to multi-modal reasoning tasks is severely bottlenecked by the prohibitive cost of expert textual annotations. To overcome this scalability limitation, we introduce \textbf{LIME}, a large-scale multi-modal dataset derived from open-access surgical videos using human-free, Large Language Model (LLM)-generated narratives. While LIME offers immense scalability, unverified generated texts may contain errors, including hallucinations, that could potentially lead to catastrophically degraded pre-trained medical priors in standard contrastive pipelines. To mitigate this, we propose \textbf{SurgLIME}, a parameter-efficient Vision-Language Pre-training (VLP) framework designed to learn reliable cross-modal alignments using noisy narratives. SurgLIME preserves foundational medical priors using a LoRA-adapted dual-encoder architecture and introduces an automated confidence estimation mechanism that dynamically down-weights uncertain text during contrastive alignment. Evaluations on the AutoLaparo and Cholec80 benchmarks show that SurgLIME achieves competitive zero-shot cross-modal alignment while preserving the robust linear probing performance of the visual foundation model. Dataset, code, and models are publicly available at \href{this https URL}{this https URL}.
大语言模型生成的文本能否赋能外科手术视觉-语言预训练? / Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?
这篇论文提出了一个名为SurgLIME的新方法,它利用大语言模型自动生成的手术视频描述文本(而非昂贵的人工标注)来训练视觉-语言模型,并通过创新的技术手段有效过滤文本中的错误信息,从而在降低标注成本的同时,保证了模型对外科手术视频的理解和推理能力。
源自 arXiv: 2604.18134