菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-13
📄 Abstract - Sign Language Recognition in the Age of LLMs

Recent Vision Language Models (VLMs) have demonstrated strong performance across a wide range of multimodal reasoning tasks. This raises the question of whether such general-purpose models can also address specialized visual recognition problems such as isolated sign language recognition (ISLR) without task-specific training. In this work, we investigate the capability of modern VLMs to perform ISLR in a zero-shot setting. We evaluate several open-source and proprietary VLMs on the WLASL300 benchmark. Our experiments show that, under prompt-only zero-shot inference, current open-source VLMs remain far behind classic supervised ISLR classifiers by a wide margin. However, follow-up experiments reveal that these models capture partial visual-semantic alignment between signs and text descriptions. Larger proprietary models achieve substantially higher accuracy, highlighting the importance of model scale and training data diversity. All our code is publicly available on GitHub.

顶级标签: natural language processing computer vision multi-modal
详细标签: sign language recognition vision language models zero-shot learning visual-semantic alignment benchmark evaluation 或 搜索:

大语言模型时代的手语识别 / Sign Language Recognition in the Age of LLMs


1️⃣ 一句话总结

这篇论文研究了当前先进的视觉语言模型是否能在不经过专门训练的情况下,直接识别孤立的手语动作,结果发现虽然大型专有模型表现尚可,但开源模型在零样本设置下仍远不如传统的有监督分类器。

源自 arXiv: 2604.11225