菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-01-28
📄 Abstract - DeepSeek-OCR 2: Visual Causal Flow

We present DeepSeek-OCR 2 to investigate the feasibility of a novel encoder-DeepEncoder V2-capable of dynamically reordering visual tokens upon image semantics. Conventional vision-language models (VLMs) invariably process visual tokens in a rigid raster-scan order (top-left to bottom-right) with fixed positional encoding when fed into LLMs. However, this contradicts human visual perception, which follows flexible yet semantically coherent scanning patterns driven by inherent logical structures. Particularly for images with complex layouts, human vision exhibits causally-informed sequential processing. Inspired by this cognitive mechanism, DeepEncoder V2 is designed to endow the encoder with causal reasoning capabilities, enabling it to intelligently reorder visual tokens prior to LLM-based content interpretation. This work explores a novel paradigm: whether 2D image understanding can be effectively achieved through two-cascaded 1D causal reasoning structures, thereby offering a new architectural approach with the potential to achieve genuine 2D reasoning. Codes and model weights are publicly accessible at this http URL.

顶级标签: computer vision multi-modal model training
详细标签: optical character recognition visual token reordering causal reasoning vision-language models image understanding 或 搜索:

DeepSeek-OCR 2:视觉因果流 / DeepSeek-OCR 2: Visual Causal Flow


1️⃣ 一句话总结

这篇论文提出了一种模仿人类视觉感知方式的新型图像编码器,它能够根据图像内容智能地重新排列视觉信息,再交给大语言模型处理,为理解复杂图像提供了一种新思路。

源自 arXiv: 2601.20552