GitHub

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

Mohamed El Amine Boudjoghra¹, Angela Dai², Jean Lahoud¹, Hisham Cholakkal¹, Rao Muhammad Anwer^1,3, Salman Khan^1,4, Fahad Khan^1,5

¹Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) ²Technical University of Munich (TUM) ³Aalto University ⁴Australian National University ⁵Linköping University

News

30 May 2024: Open-YOLO 3D released on arXiv. 📝
30 May 2024: Code released. 💻

Abstract

Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the projection of class-agnostic 3D point cloud instances already holds instance information; thus, using SAM might only result in redundancy that unnecessarily increases the inference time. We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector. We validate our Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our Open-YOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to 16x speedup compared to the best existing method in literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7% while operating at 22 seconds per scene.

Qualitative results

Installation guide

Kindly check Installation guide on how to setup the Conda environment and to download the checkpoints, the pre-computed class agnostic masks, and the ground truth masks.

Data Preparation

Kindly check Data Preparation guide on how to prepare ScanNet200 and Replica datasets.

Results reproducibility

Kindly use the pre-computed class agnostic masks we shared to reproduce the exact numbers we reported in the paper.

Reproduce the results of ScanNet200 with precomputed-masks (using Mask3D)

python run_evaluation.py --dataset_name scannet200 --path_to_3d_masks "./output/scannet200/scannet200_masks"

Reproduce the results of ScanNet200 with oracle 3D masks (ground truth 3D masks)

python run_evaluation.py --dataset_name scannet200 --path_to_3d_masks "./output/scannet200/scannet200_ground_truth_masks" --is_gt

Reproduce the results of Replica with precomputed-masks (using Mask3D)

python run_evaluation.py --dataset_name replica --path_to_3d_masks "./output/replica/replica_masks"

Reproduce the results of Replica with oracle 3D masks (ground truth 3D masks)

python run_evaluation.py --dataset_name replica --path_to_3d_masks "./output/replica/replica_ground_truth_masks" --is_gt

You can evaluate without our 3D class-agnostic masks, but this may lead to variability in results due to elements like furthest point sampling that cause randomness in predictions from Mask3D. For consistent results with the ones we report in the paper, we recommend using our pre-computed masks.

Reproduce the results of Replica or ScanNet200 without using our pre-computed masks

python run_evaluation.py --dataset_name $DATASET_NAME

Single scene inference

from utils import OpenYolo3D
import os

openyolo3d = OpenYolo3D(f"{os.getcwd()}/pretrained/config.yaml") #Initialize the model, define the text prompts in the config.
prediction = openyolo3d.predict(f"{os.getcwd()}/data/replica/office0", 6553.5) #Predict the instance masks and labels (takes around 20 seconds in total).
openyolo3d.save_output_as_ply(f"{os.getcwd()}/sample/output.ply", True) # Save the ply file for visualization, you can use meshlab to visualize the output scene

Acknoledgments

We would like to thank the authors of Mask3D and YoloWorld for their works which were used for our model.

BibTeX 🙏

@misc{boudjoghra2024openyolo,
      title={Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation}, 
      author={Mohamed El Amine Boudjoghra and Angela Dai and Jean Lahoud and Hisham Cholakkal and Rao Muhammad Anwer and Salman Khan and Fahad Shahbaz Khan},
      year={2024},
      eprint={2406.02548},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
docs		docs
evaluate		evaluate
models		models
pretrained		pretrained
sample/scene_0011_00		sample/scene_0011_00
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
check_mask.py		check_mask.py
environment.yml		environment.yml
environment_convert.py		environment_convert.py
requirements.txt		requirements.txt
run_evaluation.py		run_evaluation.py
single_scene_inference.py		single_scene_inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

News

Abstract

Qualitative results

Installation guide

Data Preparation

Results reproducibility

Single scene inference

Acknoledgments

BibTeX 🙏

About

Releases

Packages

Contributors 2

Languages

trayOWO/OpenYolo3D

Folders and files

Latest commit

History

Repository files navigation

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation

News

Abstract

Qualitative results

Installation guide

Data Preparation

Results reproducibility

Single scene inference

Acknoledgments

BibTeX 🙏

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages