3D content creation has achieved significant progress in terms of both quality and speed. Although current feed-forward models can produce 3D objects in seconds, their resolution is constrained by the intensive computation required during training. In this paper, we introduce Large Multi-View Gaussian Model (LGM), a novel framework designed to generate high-resolution 3D models from text prompts or single-view images. Our key insights are two-fold: 1) 3D Representation: We propose multi-view Gaussian features as an efficient yet powerful representation, which can then be fused together for differentiable rendering. 2) 3D Backbone: We present an asymmetric U-Net as a high-throughput backbone operating on multi-view images, which can be produced from text or single-view image input by leveraging multi-view diffusion models. Extensive experiments demonstrate the high fidelity and efficiency of our approach. Notably, we maintain the fast speed to generate 3D objects within 5 seconds while boosting the training resolution to 512, thereby achieving high-resolution 3D content generation.
3D内容创作在质量和速度方面都取得了显著进展。尽管当前的前馈模型可以在几秒钟内产生3D对象,但它们的分辨率受到训练期间所需密集计算的限制。在这篇论文中,我们介绍了大型多视图高斯模型(LGM),这是一个旨在从文本提示或单视图图像生成高分辨率3D模型的新颖框架。我们的关键洞察有两点:1) 3D表示:我们提出多视图高斯特征作为一种高效且强大的表示,然后可以将其融合用于可微渲染。2) 3D骨干网络:我们展示了一个不对称的U-Net作为高通量骨干网络,操作在多视图图像上,这些多视图图像可以通过利用多视图扩散模型从文本或单视图图像输入产生。广泛的实验展示了我们方法的高保真度和效率。值得注意的是,我们保持了在5秒内生成3D对象的快速速度,同时将训练分辨率提高到512,从而实现了高分辨率3D内容的生成。