菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-15
📄 Abstract - FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images

Humans routinely infer taste, smell, texture, and even sound from food images a phenomenon well studied in cognitive science. However, prior vision language research on food has focused primarily on recognition tasks such as meal identification, ingredient detection, and nutrition estimation. Image-based prediction of multisensory experience remains largely unexplored. We introduce FoodSense, a human-annotated dataset for cross-sensory inference containing 66,842 participant-image pairs across 2,987 unique food images. Each pair includes numeric ratings (1-5) and free-text descriptors for four sensory dimensions: taste, smell, texture, and sound. To enable models to both predict and explain sensory expectations, we expand short human annotations into image-grounded reasoning traces. A large language model generates visual justifications conditioned on the image, ratings, and descriptors. Using these annotations, we train FoodSense-VL, a vision language benchmark model to produce both multisensory ratings and grounded explanations directly from food images. This work connects cognitive science findings on cross-sensory perception with modern instruction tuning for multimodal models and shows that many popular evaluation metrics are insufficient for visually sensory inference.

顶级标签: multi-modal computer vision model evaluation
详细标签: food perception multisensory inference vision-language model dataset cross-modal prediction 或 搜索:

FoodSense:一个用于从图像预测味觉、嗅觉、质地和声音的多感官食物数据集与基准 / FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images


1️⃣ 一句话总结

这篇论文提出了一个名为FoodSense的数据集和基准模型,它能让AI通过看食物图片来预测和解释人对食物的味觉、嗅觉、口感和声音等多感官体验,而不仅仅是识别食物本身。

源自 arXiv: 2604.14388