菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-04-15
📄 Abstract - ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding

Subject-driven image generation has shown great success in creating personalized content, but its capabilities are largely confined to single subjects in common poses. Current approaches face a fundamental conflict when handling multiple subjects with complex, distinct actions: preserving individual identities while enforcing precise pose structures. This challenge often leads to identity fusion and pose distortion, as appearance and structure signals become entangled within the model's architecture. To resolve this conflict, we introduce ASTRA(Adaptive Synthesis through Targeted Retrieval Augmentation), a novel framework that architecturally disentangles subject appearance from pose structure within a unified Diffusion Transformer. ASTRA achieves this through a dual-pronged strategy. It first employs a Retrieval-Augmented Pose (RAG-Pose) pipeline to provide a clean, explicit structural prior from a curated database. Then, its core generative model learns to process these dual visual conditions using our Enhanced Universal Rotary Position Embedding (EURoPE), an asymmetric encoding mechanism that decouples identity tokens from spatial locations while binding pose tokens to the canvas. Concurrently, a Disentangled Semantic Modulation (DSM) adapter offloads the identity preservation task into the text conditioning stream. Extensive experiments demonstrate that our integrated approach achieves superior disentanglement. On our designed COCO-based complex pose benchmark, ASTRA achieves a new state-of-the-art in pose adherence, while maintaining high identity fidelity and text alignment in DreamBench.

顶级标签: computer vision model training multi-modal
详细标签: image generation diffusion transformer pose guidance retrieval-augmented generation disentangled representation 或 搜索:

ASTRA:通过检索增强姿态引导与解耦位置嵌入增强多主体生成 / ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding


1️⃣ 一句话总结

这篇论文提出了一个名为ASTRA的新框架,它通过将人物外观与姿态结构在模型内部解耦,并利用外部检索的姿态信息作为精确引导,成功解决了在生成包含多个不同姿态人物的图像时,容易出现的身份混淆和姿态扭曲问题。

源自 arXiv: 2604.13938