Text-driven 3D indoor scene generation holds broad applications, ranging from gaming and smart homes to AR/VR applications. Fast and high-fidelity scene generation is paramount for ensuring user-friendly experiences. However, existing methods are characterized by lengthy generation processes or necessitate the intricate manual specification of motion parameters, which introduces inconvenience for users. Furthermore, these methods often rely on narrow-field viewpoint iterative generations, compromising global consistency and overall scene quality. To address these issues, we propose FastScene, a framework for fast and higher-quality 3D scene generation, while maintaining the scene consistency. Specifically, given a text prompt, we generate a panorama and estimate its depth, since the panorama encompasses information about the entire scene and exhibits explicit geometric constraints. To obtain high-quality novel views, we introduce the Coarse View Synthesis (CVS) and Progressive Novel View Inpainting (PNVI) strategies, ensuring both scene consistency and view quality. Subsequently, we utilize Multi-View Projection (MVP) to form perspective views, and apply 3D Gaussian Splatting (3DGS) for scene reconstruction. Comprehensive experiments demonstrate FastScene surpasses other methods in both generation speed and quality with better scene consistency. Notably, guided only by a text prompt, FastScene can generate a 3D scene within a mere 15 minutes, which is at least one hour faster than state-of-the-art methods, making it a paradigm for user-friendly scene generation.
文本驱动的三维室内场景生成在从游戏和智能家居到增强现实/虚拟现实应用等广泛领域中具有重要应用。快速和高保真的场景生成对于确保用户友好体验至关重要。然而,现有方法的生成过程耗时或需要复杂的手动指定运动参数,这为用户带来不便。此外,这些方法通常依赖于狭窄视野的迭代生成,从而影响全局一致性和整体场景质量。为了解决这些问题,我们提出了 FastScene,一个用于快速且更高质量的三维场景生成的框架,同时保持场景的一致性。具体来说,给定一个文本提示,我们生成一个全景并估计其深度,因为全景包含了整个场景的信息并展示了明确的几何约束。为了获得高质量的新视角,我们引入了粗视图合成(CVS)和渐进式新视角填充(PNVI)策略,确保场景的一致性和视图质量。随后,我们利用多视角投影(MVP)形成透视视图,并应用三维高斯喷溅(3DGS)进行场景重建。全面的实验表明,FastScene 在生成速度和质量上都超越了其他方法,具有更好的场景一致性。值得注意的是,仅凭文本提示,FastScene 能在短短15分钟内生成一个三维场景,比最先进的方法至少快了一个小时,使其成为用户友好场景生成的典范。