棱镜假说:通过统一自编码协调语义与像素表示 / The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
1️⃣ 一句话总结
这篇论文提出了一个名为‘棱镜假说’的新观点,发现语义编码器主要捕捉低频的抽象信息,而像素编码器则额外保留高频的细节信息,并基于此设计了一个统一自编码模型,成功地将图像的抽象语义和精细像素细节融合到了一个高性能的单一表示空间中。
Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.
棱镜假说:通过统一自编码协调语义与像素表示 / The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
这篇论文提出了一个名为‘棱镜假说’的新观点,发现语义编码器主要捕捉低频的抽象信息,而像素编码器则额外保留高频的细节信息,并基于此设计了一个统一自编码模型,成功地将图像的抽象语义和精细像素细节融合到了一个高性能的单一表示空间中。
源自 arXiv: 2512.19693