菜单

关于 🐙 GitHub
arXiv 提交日期: 2025-12-26
📄 Abstract - See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.

顶级标签: multi-modal model training machine learning
详细标签: vision-language models perceptual shaping multimodal reasoning training objective visual evidence 或 搜索:

看得更少,看得更准:用于多模态推理的双向感知塑造 / See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning


1️⃣ 一句话总结

这篇论文提出了一种名为双向感知塑造的新方法,通过训练模型在回答问题时更精准地关注图像中的关键区域并避免仅依赖文本的捷径,从而显著提升了视觉语言模型在多模态推理任务上的准确性和泛化能力。

源自 arXiv: 2512.22120