Video representation is a long-standing problem that is crucial for various down-stream tasks, such as tracking,depth prediction,segmentation,view synthesis,and editing. However, current methods either struggle to model complex motions due to the absence of 3D structure or rely on implicit 3D representations that are ill-suited for manipulation tasks. To address these challenges, we introduce a novel explicit 3D representation-video Gaussian representation -- that embeds a video into 3D Gaussians. Our proposed representation models video appearance in a 3D canonical space using explicit Gaussians as proxies and associates each Gaussian with 3D motions for video motion. This approach offers a more intrinsic and explicit representation than layered atlas or volumetric pixel matrices. To obtain such a representation, we distill 2D priors, such as optical flow and depth, from foundation models to regularize learning in this ill-posed setting. Extensive applications demonstrate the versatility of our new video representation. It has been proven effective in numerous video processing tasks, including tracking, consistent video depth and feature refinement, motion and appearance editing, and stereoscopic video generation.
视频表征是一个长期存在的问题,对于各种下游任务至关重要,如跟踪、深度预测、分割、视图合成和编辑。然而,当前的方法要么因缺乏3D结构而难以模拟复杂动作,要么依赖于不适合操作任务的隐式3D表征。为了应对这些挑战,我们引入了一种新的显式3D表征——视频高斯表征,该表征将视频嵌入到3D高斯中。我们提出的表征在3D规范空间中使用显式高斯作为代理来模拟视频外观,并将每个高斯与视频运动的3D运动相关联。这种方法提供了比分层图集或体积像素矩阵更本质和显式的表征。为了获得这种表征,我们从基础模型中提取2D先验,如光流和深度,以规范这种病态设置中的学习。广泛的应用证明了我们新视频表征的多功能性。它在众多视频处理任务中被证明是有效的,包括跟踪、一致的视频深度和特征精炼、运动和外观编辑,以及立体视频生成。