Beyond Pixel Diffs: Benchmarking Image Change Captioning for Web UI Visual Regression Testing

📄 Abstract - Beyond Pixel Diffs: Benchmarking Image Change Captioning for Web UI Visual Regression Testing

Visual regression testing (VRT) is a standard quality assurance step in modern software release pipelines. On every change, it re-renders user interface (UI) screenshots, compares each one against an approved baseline image, and routes any detected difference to a human reviewer who decides whether it is an intended update or an unintended regression. A widely used approach, especially in open-source and continuous-integration pipelines, is pixel-level comparison, which is semantically blind and treats rendering noise and genuine defects identically, producing large volumes of false positives that force developers and testers to spend substantial time and effort manually reviewing flagged differences at every release cycle. Industry tools apply machine learning to VRT, but lack public evaluation. More critically, no dataset or benchmark exists to support natural language descriptions of UI changes, a capability that tells testers what changed in words instead of leaving them to interpret a binary flag or a highlighted region. To address the gap, we propose a new task, Web UI Image Change Captioning (WUICC), which sits at the intersection of VRT and image difference captioning (IDC), and release WUICC-bench, its first dataset and benchmark for the task. We evaluate eleven representative IDC methods, together with two zero-shot general-purpose LLMs. We find that: (1) these methods tend to struggle in the Web UI domain due to its layout diversity, dense text, and fine-grained changes, and (2) yet the trained methods already suppress non-meaningful visual noise far more selectively than the pixel-level comparison VRT relies on, providing a solid foundation for future domain-specific research.

超越像素差异：为Web界面视觉回归测试建立图像变化描述基准 / Beyond Pixel Diffs: Benchmarking Image Change Captioning for Web UI Visual Regression Testing

1️⃣ 一句话总结

本文针对现有Web界面视觉回归测试中像素对比方法产生大量误报的痛点，提出了一个新的任务——Web界面图像变化描述，并发布了首个包含数据集的基准，通过评估现有模型发现它们在处理网页布局多样性和细微变化时表现不佳，但相比像素级方法已能更智能地过滤无意义噪声。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要