菜单

🤖 系统
📄 Abstract - Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation

Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: this https URL.

顶级标签: text-to-video model evaluation natural language processing
详细标签: prompt engineering inference-time scaling visual generation factual correction alignment evaluation 或 搜索:

重新思考文本到视觉生成中推理时扩展的提示设计 / Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation


1️⃣ 一句话总结

这篇论文提出了一个名为PRIS的新框架,它通过在生成过程中动态分析和修改文本提示来改进AI图像和视频的生成质量,而不是像传统方法那样只增加生成次数,从而更有效地将用户意图与生成结果对齐。


📄 打开原文 PDF