📄
Abstract - Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback
Despite generating increasingly photorealistic images, text-to-image (T2I) models still exhibit localized, subtle, and structurally complex failures. Diagnosing these failures requires instance-level feedback that answers where a defect occurs, what type it is, why it is defective, and its importance to overall image quality. While recent dense-feedback methods move beyond scalar supervision, their heatmap-centric representations still formulate diagnosis as pixel-field regression, making it difficult to localize variable-cardinality defects and bind semantic reasons to individual failures. To address this representation bottleneck, we propose Structured Defect Grounding (SDG), which casts T2I diagnosis as structured set prediction by modeling each defect as a (location, type, reason, importance) tuple. To make this formulation trainable and measurable, we introduce SDG-30K, a 30K-image dataset with box-grounded annotations across four modern T2I generators, together with a dedicated evaluation protocol, SDG-Eval. Building on this structured representation, we further present a diagnosis-to-alignment framework in which a Vision-Language Model (VLM) serves as the SDG detector, and BoxFlow-GRPO converts predicted defect sets into box-derived, importance-weighted spatial rewards for diffusion model alignment. Extensive experiments show that our SDG detector outperforms leading proprietary VLMs on structured defect grounding, while SDG-guided rewards consistently improve T2I alignment and support localized image refinement. These results establish SDG as a unified, instance-level interface for diagnosing, evaluating, and enhancing modern generative models.
缺陷的结构化定位:为文本到图像反馈提供位置、类型、原因及重要性 /
Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback
1️⃣ 一句话总结
该论文提出了一种名为“结构化缺陷定位”(SDG)的新方法,将文本生成图像中的缺陷诊断转化为结构化预测任务,通过为每个缺陷标注“位置、类型、原因和重要性”四个要素,并配合新构建的数据集和评估标准,显著提升了对图像缺陷的精准定位与语义解释能力,从而帮助改善图像生成质量。