Recent progress in pre-trained diffusion models and 3D generation have spurred interest in 4D content creation. However, achieving high-fidelity 4D generation with spatial-temporal consistency remains a challenge. In this work, we propose STAG4D, a novel framework that combines pre-trained diffusion models with dynamic 3D Gaussian splatting for high-fidelity 4D generation. Drawing inspiration from 3D generation techniques, we utilize a multi-view diffusion model to initialize multi-view images anchoring on the input video frames, where the video can be either real-world captured or generated by a video diffusion model. To ensure the temporal consistency of the multi-view sequence initialization, we introduce a simple yet effective fusion strategy to leverage the first frame as a temporal anchor in the self-attention computation. With the almost consistent multi-view sequences, we then apply the score distillation sampling to optimize the 4D Gaussian point cloud. The 4D Gaussian spatting is specially crafted for the generation task, where an adaptive densification strategy is proposed to mitigate the unstable Gaussian gradient for robust optimization. Notably, the proposed pipeline does not require any pre-training or fine-tuning of diffusion networks, offering a more accessible and practical solution for the 4D generation task. Extensive experiments demonstrate that our method outperforms prior 4D generation works in rendering quality, spatial-temporal consistency, and generation robustness, setting a new state-of-the-art for 4D generation from diverse inputs, including text, image, and video.
近期,预训练的扩散模型和3D生成技术的进步激发了对4D内容创作的兴趣。然而,实现具有空间-时间一致性的高保真4D生成仍然是一个挑战。在这项工作中,我们提出了STAG4D,一个新颖的框架,结合了预训练的扩散模型和动态3D高斯喷溅技术,用于高保真4D生成。借鉴3D生成技术的灵感,我们利用多视图扩散模型来初始化固定在输入视频帧上的多视图图像,其中视频可以是现实世界捕获的,也可以是通过视频扩散模型生成的。为了确保多视图序列初始化的时间一致性,我们引入了一个简单而有效的融合策略,利用第一帧作为自注意力计算中的时间锚。通过几乎一致的多视图序列,我们随后应用得分蒸馏采样来优化4D高斯点云。4D高斯喷溅特别为生成任务设计,其中提出了一种适应性增密策略,以缓解不稳定的高斯梯度,实现稳健的优化。值得注意的是,所提出的流程不需要任何预训练或微调扩散网络,为4D生成任务提供了一个更加可行和实用的解决方案。广泛的实验表明,我们的方法在渲染质量、空间-时间一致性和生成鲁棒性方面超越了以往的4D生成工作,为从多样化输入(包括文本、图像和视频)生成4D内容设定了新的行业标准。