GDCNet:用于多模态讽刺检测的生成式差异比较网络 / GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection
1️⃣ 一句话总结
这篇论文提出了一种名为GDCNet的新方法,它通过利用多模态大模型生成的客观图像描述作为稳定参照,来精确比较图像与文本之间的语义和情感差异,从而更准确、更鲁棒地检测出图文内容中的讽刺意味。
Multimodal sarcasm detection (MSD) aims to identify sarcasm within image-text pairs by modeling semantic incongruities across modalities. Existing methods often exploit cross-modal embedding misalignment to detect inconsistency but struggle when visual and textual content are loosely related or semantically indirect. While recent approaches leverage large language models (LLMs) to generate sarcastic cues, the inherent diversity and subjectivity of these generations often introduce noise. To address these limitations, we propose the Generative Discrepancy Comparison Network (GDCNet). This framework captures cross-modal conflicts by utilizing descriptive, factually grounded image captions generated by Multimodal LLMs (MLLMs) as stable semantic anchors. Specifically, GDCNet computes semantic and sentiment discrepancies between the generated objective description and the original text, alongside measuring visual-textual fidelity. These discrepancy features are then fused with visual and textual representations via a gated module to adaptively balance modality contributions. Extensive experiments on MSD benchmarks demonstrate GDCNet's superior accuracy and robustness, establishing a new state-of-the-art on the MMSD2.0 benchmark.
GDCNet:用于多模态讽刺检测的生成式差异比较网络 / GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection
这篇论文提出了一种名为GDCNet的新方法,它通过利用多模态大模型生成的客观图像描述作为稳定参照,来精确比较图像与文本之间的语义和情感差异,从而更准确、更鲁棒地检测出图文内容中的讽刺意味。
源自 arXiv: 2601.20618