菜单

🤖 系统
📄 Abstract - AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs

Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at this https URL.

顶级标签: computer vision model evaluation benchmark
详细标签: image-text alignment vision-language models synthetic data clip evaluation fine-grained assessment 或 搜索:

AlignBench:利用合成图像-描述对评估细粒度图文对齐的基准 / AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs


1️⃣ 一句话总结

这篇论文提出了一个名为AlignBench的新基准测试,它通过评估由多种模型生成的详细图文对来更精细地衡量图像与文本的对齐程度,并发现当前主流模型在细粒度对齐上存在明显缺陷。


📄 打开原文 PDF