We introduce 3DEgo to address a novel problem of directly synthesizing photorealistic 3D scenes from monocular videos guided by textual prompts. Conventional methods construct a text-conditioned 3D scene through a three-stage process, involving pose estimation using Structure-from-Motion (SfM) libraries like COLMAP, initializing the 3D model with unedited images, and iteratively updating the dataset with edited images to achieve a 3D scene with text fidelity. Our framework streamlines the conventional multi-stage 3D editing process into a single-stage workflow by overcoming the reliance on COLMAP and eliminating the cost of model initialization. We apply a diffusion model to edit video frames prior to 3D scene creation by incorporating our designed noise blender module for enhancing multi-view editing consistency, a step that does not require additional training or fine-tuning of T2I diffusion models. 3DEgo utilizes 3D Gaussian Splatting to create 3D scenes from the multi-view consistent edited frames, capitalizing on the inherent temporal continuity and explicit point cloud data. 3DEgo demonstrates remarkable editing precision, speed, and adaptability across a variety of video sources, as validated by extensive evaluations on six datasets, including our own prepared GS25 dataset.
我们引入3DEgo来解决一个新问题:直接从单眼视频中,通过文本提示引导合成逼真的3D场景。传统方法通过三阶段过程构建一个文本条件的3D场景,包括使用Structure-from-Motion(SfM)库如COLMAP进行姿态估计,用未编辑的图像初始化3D模型,并通过迭代更新编辑过的图像数据集以实现具有文本保真度的3D场景。我们的框架通过克服对COLMAP的依赖并消除模型初始化的成本,将传统的多阶段3D编辑过程简化为单阶段工作流。我们在3D场景创建之前应用扩散模型编辑视频帧,整合我们设计的噪声混合模块以增强多视图编辑的一致性,这一步骤不需要额外的训练或微调T2I扩散模型。3DEgo利用3D高斯喷溅从多视图一致的编辑帧中创建3D场景,利用固有的时间连续性和显式点云数据。3DEgo在各种视频源上展示了卓越的编辑精度、速度和适应性,通过在六个数据集上的广泛评估进行了验证,包括我们自己准备的GS25数据集。