Skip to content

david-rohrschneider/SEEM-Adapter

Β 
Β 

Repository files navigation

πŸ‘€SEEM: Segment Everything Everywhere All at Once Adapter

This is the official repository for "Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks"

Nermeen Abou Baker*, David Rohrschneider*, Uwe Handmann

Installation

  • Install openmpi (required for mpirun):
    sudo apt install libopenmpi-dev
  • Install the requirements:
    pip install -r assets/requirements/requirements.txt
    pip install -r assets/requirements/requirements_custom.txt
  • [OPTIONAL] Build the deformable attention module:
    cd modeling/vision/encoder/ops && sh make.sh && cd ../../../../
  • Download the datasets (NDD-20, ZeroWaste-f, WIXray, Cityscapes) and configure their file paths in their corresponding dataset registration files of this repo.

Fine-tuning

  • Download the base weights:
    wget https://huggingface.co/xdecoder/SEEM/resolve/main/seem_focall_v1.pt
  • Configure your fine-tuning settings within the provided training shell script example by filling out the <<FIELDS>>
    • SOLVER.IGNORE_FIX are the module names you want to be trainable. Adapter and LoRA weights are set trainable by default if enabled, so the don't need to be defined here.
  • Set up the wandb environment variables of .env.example by copying the file to .env and filling out your credentials for weights an biases
  • Start the training with source train_adapter.sh or source train_lora.sh

Citation

To cite our work, please use the following citation:

@Article{make6040133,
    AUTHOR = {Abou Baker, Nermeen and Rohrschneider, David and Handmann, Uwe},
    TITLE = {Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks},
    JOURNAL = {Machine Learning and Knowledge Extraction},
    VOLUME = {6},
    YEAR = {2024},
    NUMBER = {4},
    PAGES = {2783--2807},
    URL = {https://www.mdpi.com/2504-4990/6/4/133},
    ISSN = {2504-4990},
    DOI = {10.3390/make6040133}
}

Original Readme:

πŸ‡ [Read our arXiv Paper] Β  🍎 [Try our Demo]

We introduce SEEM that can Segment Everything Everywhere with Multi-modal prompts all at once. SEEM allows users to easily segment an image using prompts of different types including visual prompts (points, marks, boxes, scribbles and image segments) and language prompts (text and audio), etc. It can also work with any combination of prompts or generalize to custom prompts!

by Xueyan Zou*, Jianwei Yang*, Hao Zhang*, Feng Li*, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao^, Yong Jae Lee^, in NeurIPS 2023.

A brief introduction of all the generic and interactive segmentation tasks we can do!

SEEM design

πŸš€ Updates

  • [2023.10.10] We release the training log for SEEM-Large-v1!
  • [2023.10.08] We release the training log for SEEM-Tiny-v1!
  • [2023.10.04] We are excited to release βœ… training/evaluation/demo code, βœ… new checkpoints, and βœ… comprehensive readmes for both X-Decoder and SEEM!
  • [2023.09.25] Our work has been accepted to NeurIPS 2023!
  • [2023.07.27] We are excited to release our X-Decoder training code! We will release its descendant SEEM training code very soon!
  • [2023.07.10] We release Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Code and checkpoint are available!
  • [2023.05.02] We have released the SEEM Focal-L and X-Decoder Focal-L checkpoints and configs!
  • [2023.04.28] We have updated the ArXiv that shows better interactive segmentation results than SAM, which trained on x50 more data than us!
  • [2023.04.26] We have released the Demo Code and SEEM-Tiny Checkpoint! Please try the One-Line Started!
  • [2023.04.20] SEEM Referring Video Segmentation is out! Please try the Video Demo and take a look at the NERF examples.

πŸ“‘ Catalog

We release the following contents for both SEEM and X-Decoder❗

  • Demo Code
  • Model Checkpoint
  • Comprehensive User Guide
  • Training Code
  • Evaluation Code

πŸ‘‰ One-Line SEEM Demo with Linux:

git clone [email protected]:UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git && sh assets/scripts/run_demo.sh

πŸ“ [New] Getting Started:

πŸ“ [New] Latest Checkpoints and Numbers:

COCO Ref-COCOg VOC SBD
Method Checkpoint Backbone PQ ↑ mAP ↑ mIoU ↑ cIoU ↑ mIoU ↑ AP50 ↑ NoC85 ↓ NoC90 ↓ NoC85 ↓ NoC90 ↓
X-Decoder ckpt Focal-T 50.8 39.5 62.4 57.6 63.2 71.6 - - - -
X-Decoder-oq201 ckpt Focal-L 56.5 46.7 67.2 62.8 67.5 76.3 - - - -
SEEM_v0 ckpt Focal-T 50.6 39.4 60.9 58.5 63.5 71.6 3.54 4.59 * *
SEEM_v0 - Davit-d3 56.2 46.8 65.3 63.2 68.3 76.6 2.99 3.89 5.93 9.23
SEEM_v0 ckpt Focal-L 56.2 46.4 65.5 62.8 67.7 76.2 3.04 3.85 * *
SEEM_v1 ckpt SAM-ViT-B 52.0 43.5 60.2 54.1 62.2 69.3 2.53 3.23 * *
SEEM_v1 ckpt SAM-ViT-L 49.0 41.6 58.2 53.8 62.2 69.5 2.40 2.96 * *
SEEM_v1 ckpt/log Focal-T 50.8 39.4 60.7 58.5 63.7 72.0 3.19 4.13 * *
SEEM_v1 ckpt/log Focal-L 56.1 46.3 65.8 62.4 67.8 76.0 2.66 3.44 * *

SEEM_v0: Supporting Single Interactive object training and inference
SEEM_v1: Supporting Multiple Interactive objects training and inference

πŸ”₯ Related projects:

  • FocalNet and DaViT : We used FocalNet and DaViT as the vision backbones.
  • UniCL : We used unified contrastive learning technique for learning image-text representations.
  • X-Decoder : We built SEEM based on X-Decoder which is a generalist decoder that can do multiple tasks with one model only.

πŸ”₯ Other projects you may find interesting:

  • Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity
  • OpenSeed : Strong open-set segmentation methods.
  • Grounding SAM : Combining Grounding DINO and Segment Anything; Grounding DINO: A strong open-set detection model.
  • X-GPT : Conversational Visual Agent supported by X-Decoder.
  • LLaVA : Large Language and Vision Assistant.

πŸ’‘ Highlights

Inspired by the appealing universal interface in LLMs, we are advocating a universal, interactive multi-modal interface for any type of segmentation with ONE SINGLE MODEL. We emphasize 4 important features of SEEM below.

  1. Versatility: work with various types of prompts, for example, clicks, boxes, polygons, scribbles, texts, and referring image;
  2. Compositionaliy: deal with any compositions of prompts;
  3. Interactivity: interact with user in multi-rounds, thanks to the memory prompt of SEEM to store the session history;
  4. Semantic awareness: give a semantic label to any predicted mask;

πŸ¦„ How to use the demo

  • Try our default examples first;
  • Upload an image;
  • Select at least one type of prompt of your choice (If you want to use referred region of another image please check "Example" and upload another image in referring image panel);
  • Remember to provide the actual prompt for each prompt type you select, otherwise you will meet an error (e.g., remember to draw on the referring image);
  • Our model by default support the vocabulary of COCO 80 categories, others will be classified to 'others' or misclassified. If you want to segment using open-vocabulary labels, include the text label in 'text' button after drawing scribbles.
  • Click "Submit" and wait for a few seconds.

πŸŒ‹ An interesting example

An example of Transformers. The referred image is the truck form of Optimus Prime. Our model can always segment Optimus Prime in target images no matter which form it is in. Thanks Hongyang Li for this fun example.

assets/images/transformers_gh.png

🌷 NERF Examples

  • Inspired by the example in SA3D, we tried SEEM on NERF Examples and works well :)

πŸ•οΈ Click, scribble to mask

With a simple click or stoke from the user, we can generate the masks and the corresponding category labels for it.

SEEM design

πŸ”οΈ Text to mask

SEEM can generate the mask with text input from the user, providing multi-modality interaction with human.

example

πŸ•Œ Referring image to mask

With a simple click or stroke on the referring image, the model is able to segment the objects with similar semantics on the target images. example

SEEM understands the spatial relationship very well. Look at the three zebras! The segmented zebras have similar positions with the referred zebras. For example, when the leftmost zebra is referred on the upper row, the leftmost zebra on the bottom row is segmented. example

🌼 Referring image to video mask

No training on video data needed, SEEM works perfectly for you to segment videos with whatever queries you specify! example

🌻 Audio to mask

We use Whisper to turn audio into text prompt to segment the object. Try it in our demo!

assets/images/audio.png

🌳 Examples of different styles

An example of segmenting a meme.

assets/images/emoj.png

An example of segmenting trees in cartoon style.

assets/images/trees_text.png

An example of segmenting a Minecraft image.

assets/images/minecraft.png
An example of using referring image on a popular teddy bear.

example

Model

SEEM design

Comparison with SAM

In the following figure, we compare the levels of interaction and semantics of three segmentation tasks (edge detection, open-set, and interactive segmentation). Open-set Segmentation usually requires a high level of semantics and does not require interaction. Compared with SAM, SEEM covers a wider range of interaction and semantics levels. For example, SAM only supports limited interaction types like points and boxes, while misses high-semantic tasks since it does not output semantic labels itself. The reasons are: First, SEEM has a unified prompt encoder that encodes all visual and language prompts into a joint representation space. In consequence, SEEM can support more general usages. It has potential to extend to custom prompts. Second, SEEM works very well on text to mask (grounding segmentation) and outputs semantic-aware predictions.

assets/images/compare.jpg

πŸ’˜ Acknowledgements

  • We appreciate hugging face for the GPU support on demo!

About

SEEM with adapter finetuning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.9%
  • Cuda 4.4%
  • Other 0.7%