Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

📄 Abstract - Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions with deeper insights, such as object attributes, event context, and underlying significance, by leveraging external knowledge. Our approach features a hierarchical multi-modal article retrieval mechanism that moves beyond monolithic text entities. This retrieval considers article structure-aware features, including weighted textual components (e.g., headlines, body sections) and visual placement patterns, alongside multi-faceted similarity computations (content--visual, visual--visual, and discourse positioning). A subsequent contextual relevance refinement stage further enhances the retrieved information. The retrieved articles then serve as the knowledge base for caption generation: first, a VLM generates a concise image description; second, we segment relevant information from the retrieved articles based on this description; and finally, an LLM utilizes both the description and extracted knowledge to generate a comprehensive, contextually detailed caption. We participated in the ACM Multimedia EVENTA 2025 Challenge and achieved 5th place with an overall score of 0.2824 on the private test set of the OpenEvent-V1 dataset. Source code is publicly released at this https URL.

基于层级多模态检索的知识增强新闻图像描述生成 / Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

1️⃣ 一句话总结

本文提出了一种新的图像描述生成框架，通过层级化检索文章结构（如标题、正文和图像位置）并融合视觉与文本信息，帮助AI在生成新闻图片描述时补充图中看不到的深层背景知识，从而产出更丰富、更具上下文感的说明文字。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要