菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-02-26
📄 Abstract - SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

The rapid advancement of AI-powered smart glasses, one of the hottest wearable devices, has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPERGLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this benchmark, revealing significant performance gaps. To address the limitations of existing models, we further propose SUPERLENS, a multimodal smart glasses agent that enables retrieval-augmented answer generation by integrating automatic object detection, query decoupling, and multimodal web search. Our agent achieves state-of-the-art performance, surpassing GPT-4o by 2.19 percent, and highlights the need for task-specific solutions in smart glasses VQA scenarios.

顶级标签: multi-modal benchmark agents
详细标签: vision language models smart glasses visual question answering egocentric vision retrieval-augmented generation 或 搜索:

SUPERGLASSES:将视觉语言模型作为智能眼镜智能代理的基准测试 / SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses


1️⃣ 一句话总结

这篇论文提出了首个基于真实智能眼镜数据构建的视觉问答基准测试SUPERGLASSES,并设计了一个名为SUPERLENS的新型智能眼镜代理,该代理通过整合目标检测和网络搜索,在回答问题时超越了GPT-4o等现有模型,为解决智能眼镜场景下的特定挑战提供了新方案。

源自 arXiv: 2602.22683