菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-06
📄 Abstract - CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.

顶级标签: multi-modal model training model evaluation
详细标签: image degradation multimodal understanding generative reasoning robustness reinforcement learning 或 搜索:

CLEAR:在统一多模态模型中解锁生成潜力以理解退化图像 / CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models


1️⃣ 一句话总结

这篇论文提出了一个名为CLEAR的框架,通过训练模型在推理时主动生成图像细节并优化生成与理解之间的连接,显著提升了统一多模态模型在应对模糊、噪声等退化图像时的理解能力,同时不影响其在清晰图像上的原有性能。

源自 arXiv: 2604.04780