菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-02
📄 Abstract - Beyond Semantics: Modeling Factual and Affective Perceptual Experiences from Vision-Language Data

We present P-Topics (Perception Topics) modeling, a novel problem for understanding how images are perceived affectively and across cultures. The goal is to (1) discover and model the different perception experiences in a dataset of images and captions, where each experience is defined by an objective factual and a subjective affective aspect, and (2) associate images to their relevant perception experiences. We introduce **PercepT** (**Percep**tion topic **T**ransformer), a two-stage architecture that tackles P-Topics modeling. In the formation stage, percepT discovers *P-Topics* as visual-textual clusters using an unsupervised training objective, and dynamically selects the number of clusters to match the perceptual richness of the dataset. In the mapping stage, it learns *P-Topic mapping functions* via attention pooling to associate images to their respective clusters. On ArtELingo, PercepT achieves a silhouette score of **0.97** compared to **0.37** from the closest baseline reflecting better perceptual clusters. PercepT also achieves an AUC score of **0.94** compared to **0.77** showing better mapping to perceptual clusters. Human evaluation confirms that PercepT captures semantically meaningful perception experiences and significantly outperforms existing methods. Our implementation will be made public.

顶级标签: multi-modal machine learning computer vision
详细标签: perception modeling vision-language unsupervised clustering affective computing cross-cultural 或 搜索:

超越语义:从视觉-语言数据中建模事实与情感感知体验 / Beyond Semantics: Modeling Factual and Affective Perceptual Experiences from Vision-Language Data


1️⃣ 一句话总结

本文提出了一种名为P-Topics的新方法,通过两阶段变换器模型(PercepT),自动从图像与文字描述中挖掘出反映不同文化下人们对同一图像的事实和情感感知模式,从而超越单纯语义分析,更准确地理解图像如何被不同人群主观地感受。

源自 arXiv: 2606.03345