Skip to content

Latest commit

 

History

History
8 lines (6 loc) · 3.02 KB

2412.04380.md

File metadata and controls

8 lines (6 loc) · 3.02 KB

EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding

3D occupancy prediction provides a comprehensive description of the surrounding scenes and has become an essential task for 3D perception. Most existing methods focus on offline perception from one or a few views and cannot be applied to embodied agents which demands to gradually perceive the scene through progressive embodied exploration. In this paper, we formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize the global scene with uniform 3D semantic Gaussians and progressively update local regions observed by the embodied agent. For each update, we extract semantic and structural features from the observed image and efficiently incorporate them via deformable cross-attention to refine the regional Gaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global 3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown (i.e., uniformly distributed) environment and maintains an explicit global memory of it with 3D Gaussians. It gradually gains knowledge through local refinement of regional Gaussians, which is consistent with how humans understand new scenes through embodied exploration. We reorganize an EmbodiedOcc-ScanNet benchmark based on local annotations to facilitate the evaluation of the embodied 3D occupancy prediction task. Experiments demonstrate that our EmbodiedOcc outperforms existing local prediction methods and accomplishes the embodied occupancy prediction with high accuracy and strong expandability.

三维占据预测能够全面描述周围场景,是三维感知领域的一项核心任务。目前大多数方法专注于基于单视图或少量视图的离线感知,无法满足具身智能体(embodied agents)逐步通过探索感知场景的需求。本文针对这一实际场景,提出了具身三维占据预测任务(embodied 3D occupancy prediction),并设计了基于高斯的 EmbodiedOcc 框架来实现。 我们以均匀分布的三维语义高斯初始化全局场景,并逐步更新具身智能体观测到的局部区域。对于每次更新,我们从观测图像中提取语义和结构特征,并通过高效的可变形跨注意力机制(deformable cross-attention)整合这些特征,以优化区域高斯表示。最终,通过高斯到体素的点绘(Gaussian-to-voxel splatting)将更新后的三维高斯转化为全局三维占据表示。 EmbodiedOcc 假设环境未知(即初始为均匀分布),并通过三维高斯显式维护全局记忆。它通过对局部区域的逐步优化来逐渐获取知识,这种方式与人类通过具身探索理解新场景的过程一致。我们基于局部标注重组了 EmbodiedOcc-ScanNet 基准,用于评估具身三维占据预测任务。 实验表明,EmbodiedOcc 超越了现有局部预测方法,在高精度和强扩展性方面表现出色,成功实现了具身占据预测任务。