Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

📄 Abstract - Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object properties, motion paradigms, and dynamic constraints into structured prompts. By delivering these physical priors through multi-turn dialogues, our method decomposes causal reasoning into incremental steps, enabling robust internal representations of normal and abnormal dynamics. Evaluated on the Phys-AD benchmark, our approach achieves 96.7% AUROC in video-level detection--substantially outperforming prior SOTA (66.9%)--and yields superior causal explanations (0.777 LLM score). This work highlights how structured physics priors can transform VLMs into reliable detectors of dynamic anomalies.

用于物理基础异常检测的多轮物理信息视觉语言模型 / Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

1️⃣ 一句话总结

这项研究通过在多轮对话中融入物体属性、运动规律等物理知识，显著提升了通用视觉语言模型在检测违反物理规律的动态异常（如不规则旋转）方面的能力，使其性能远超现有最佳方法。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要