AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation

📄 Abstract - AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation

Multi-Modal Diffusion Transformers (MM-DiTs) encode rich representations for training-free concept grounding, but existing attention-based methods often produce overlapping activations on visually confusable concepts, a failure mode we call concept leakage, where target responses spill over to non-target objects. To address this issue, we propose AnchorDiff, a training-free grounding method that decouples semantic localization from structural refinement. AnchorDiff selects a high-confidence anchor from concept-to-image attention map and propagates it as a one-hot seed over a hybrid graph derived from image-to-image self-attention. The graph uses output-space similarity for dense within-object propagation and a row-wise attention gate to suppress cross-object connections. Additionally, we introduce the Multi-Concept Confusion Dataset, which contains images with multiple visually similar concepts and separate masks, enabling explicit evaluation of concept leakage. Experiments show that AnchorDiff achieves strong grounding performance on ImageNet-Segmentation and PascalVOC, while substantially reducing concept leakage on our Multi-Concept Confusion Dataset.

AnchorDiff：基于锚点图传播的无训练概念定位方法用于多模态扩散Transformer / AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation

1️⃣ 一句话总结

本文提出了一种无需额外训练的方法AnchorDiff，通过先从注意力图中选出高置信度的锚点，再利用图传播技术将信息精确扩散到图像中的对应物体上，有效解决了多模态扩散模型在混淆概念上的错误激活问题。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要