Kling-Omni技术报告 / Kling-Omni Technical Report
1️⃣ 一句话总结
这篇论文提出了一个名为Kling-Omni的通用视频生成框架,它能够根据文字、图片或视频片段等多种形式的指令,直接生成高质量、高智能的视频内容,并将视频生成、编辑和推理任务统一起来,是迈向能够感知和模拟动态复杂世界的多模态系统的重要一步。
We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.
Kling-Omni技术报告 / Kling-Omni Technical Report
这篇论文提出了一个名为Kling-Omni的通用视频生成框架,它能够根据文字、图片或视频片段等多种形式的指令,直接生成高质量、高智能的视频内容,并将视频生成、编辑和推理任务统一起来,是迈向能够感知和模拟动态复杂世界的多模态系统的重要一步。
源自 arXiv: 2512.16776