3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene and is an important task for the robustness of vision-centric autonomous driving. Most existing methods employ dense grids such as voxels as scene representations, which ignore the sparsity of occupancy and the diversity of object scales and thus lead to unbalanced allocation of resources. To address this, we propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians where each Gaussian represents a flexible region of interest and its semantic features. We aggregate information from images through the attention mechanism and iteratively refine the properties of 3D Gaussians including position, covariance, and semantics. We then propose an efficient Gaussian-to-voxel splatting method to generate 3D occupancy predictions, which only aggregates the neighboring Gaussians for a certain position. We conduct extensive experiments on the widely adopted nuScenes and KITTI-360 datasets. Experimental results demonstrate that GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8% - 24.8% of their memory consumption.
三维语义占用预测旨在获取周围场景的三维细粒度几何形状和语义,这是视觉中心自动驾驶系统稳定性的重要任务。大多数现有方法使用密集网格(如体素)作为场景表示,这忽略了占用的稀疏性和对象尺度的多样性,从而导致资源分配不均衡。为了解决这一问题,我们提出了一种以对象为中心的表示方法来描述三维场景,使用稀疏的三维语义高斯表示,每个高斯代表一个灵活的兴趣区域及其语义特征。我们通过注意力机制从图像中聚合信息,并迭代细化三维高斯的属性,包括位置、协方差和语义。接着,我们提出了一种高效的高斯到体素的喷溅方法来生成三维占用预测,该方法只聚合特定位置附近的高斯。我们在广泛采用的nuScenes和KITTI-360数据集上进行了广泛的实验。实验结果表明,GaussianFormer在只有17.8%至24.8%的内存消耗下,达到了与最先进方法相当的性能。