菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-17
📄 Abstract - Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility

While synthetic data has proven effective for improving scientific reasoning in the text domain, multimodal reasoning remains constrained by the difficulty of synthesizing scientifically rigorous images. Existing Text-to-Image (T2I) models often produce outputs that are visually plausible yet scientifically incorrect, resulting in a persistent visual-logic divergence that limits their value for downstream reasoning. Motivated by recent advances in next-generation T2I models, we conduct a systematic study of scientific image synthesis across generation paradigms, evaluation, and downstream use. We analyze both direct pixel-based generation and programmatic synthesis, and propose ImgCoder, a logic-driven framework that follows an explicit "understand - plan - code" workflow to improve structural precision. To rigorously assess scientific correctness, we introduce SciGenBench, which evaluates generated images based on information utility and logical validity. Our evaluation reveals systematic failure modes in pixel-based models and highlights a fundamental expressiveness-precision trade-off. Finally, we show that fine-tuning Large Multimodal Models (LMMs) on rigorously verified synthetic scientific images yields consistent reasoning gains, with potential scaling trends analogous to the text domain, validating high-fidelity scientific synthesis as a viable path to unlocking massive multimodal reasoning capabilities.

顶级标签: multi-modal model evaluation benchmark
详细标签: scientific image synthesis text-to-image logical validity multimodal reasoning synthetic data 或 搜索:

科学图像合成:基准测试、方法论与下游应用 / Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility


1️⃣ 一句话总结

这篇论文系统地研究了如何生成科学上正确的图像,提出了一个能提升结构精度的逻辑驱动框架和一个评估科学正确性的新基准,并证明使用高质量合成图像训练大模型能有效提升其多模态推理能力。

源自 arXiv: 2601.17027