基于层级多模态检索的知识增强新闻图像描述生成 / Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning
1️⃣ 一句话总结
本文提出了一种新的图像描述生成框架,通过层级化检索文章结构(如标题、正文和图像位置)并融合视觉与文本信息,帮助AI在生成新闻图片描述时补充图中看不到的深层背景知识,从而产出更丰富、更具上下文感的说明文字。
Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions with deeper insights, such as object attributes, event context, and underlying significance, by leveraging external knowledge. Our approach features a hierarchical multi-modal article retrieval mechanism that moves beyond monolithic text entities. This retrieval considers article structure-aware features, including weighted textual components (e.g., headlines, body sections) and visual placement patterns, alongside multi-faceted similarity computations (content--visual, visual--visual, and discourse positioning). A subsequent contextual relevance refinement stage further enhances the retrieved information. The retrieved articles then serve as the knowledge base for caption generation: first, a VLM generates a concise image description; second, we segment relevant information from the retrieved articles based on this description; and finally, an LLM utilizes both the description and extracted knowledge to generate a comprehensive, contextually detailed caption. We participated in the ACM Multimedia EVENTA 2025 Challenge and achieved 5th place with an overall score of 0.2824 on the private test set of the OpenEvent-V1 dataset. Source code is publicly released at this https URL.
基于层级多模态检索的知识增强新闻图像描述生成 / Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning
本文提出了一种新的图像描述生成框架,通过层级化检索文章结构(如标题、正文和图像位置)并融合视觉与文本信息,帮助AI在生成新闻图片描述时补充图中看不到的深层背景知识,从而产出更丰富、更具上下文感的说明文字。
源自 arXiv: 2606.18553