Unsupervised video object segmentation aims to segment the most prominent object in a video sequence. However, the existence of complex backgrounds and multiple foreground objects make this task challenging. To address this issue, we propose a guided slot attention network to reinforce spatial structural information and obtain better foreground--background separation. The foreground and background slots, which are initialized with query guidance, are iteratively refined based on interactions with template information. Furthermore, to improve slot--template interaction and effectively fuse global and local features in the target and reference frames, K-nearest neighbors filtering and a feature aggregation transformer are introduced. The proposed model achieves state-of-the-art performance on two popular datasets. Additionally, we demonstrate the robustness of the proposed model in challenging scenes through various comparative experiments.
We use fast_pytorch_kmeans for the GPU-accelerated Kmeans algorithm.
pip install fast-pytorch-kmeans
We use the DUTS train dataset for model pretraining and the DAVIS 2016 dataset for fintuning. For DAVIS 2016, RAFT is used to generate optical flow maps. The complete dataset directory structure is as follows:
dataset dir/
├── DUTS_train/
│ ├── RGB/
│ │ ├── sun_ekmqudbbrseiyiht.jpg
│ │ ├── sun_ejwwsnjzahzakyjq.jpg
│ │ └── ...
│ └── GT/
│ ├── sun_ekmqudbbrseiyiht.png
│ ├── sun_ejwwsnjzahzakyjq.png
│ └── ...
├── DAVIS_train/
│ ├── RGB/
│ │ ├── bear_00000.jpg
│ │ ├── bear_00001.jpg
│ │ └── ...
│ ├── GT/
│ │ ├── bear_00000.png
│ │ ├── bear_00001.png
│ │ └── ...
│ └── FLOW/
│ ├── bear_00000.jpg
│ ├── bear_00001.jpg
│ └── ...
└── DAVIS_test/
├── blackswan/
│ ├── RGB/
│ │ ├── blackswan_00000.jpg
│ │ ├── blackswan_00001.jpg
│ │ └── ...
│ ├── GT/
│ │ ├── blackswan_00000.png
│ │ ├── blackswan_00001.png
│ │ └── ...
│ └── FLOW/
│ ├── blackswan_00000.jpg
│ ├── blackswan_00001.jpg
│ └── ...
├── bmx-trees
└── ...
We use a two-stage learning strategy: pretraining and finetuning.
- Edit config.py. The data root path option and GPU index should be modified.
- training
python pretrain.py
- Edit config.py. The best model path generated during the pretraining process is required.
- training
python train.py
TBD
An example of the resulting image is shown below.
- A : RGB image
- B : Optical Flow map
- C : Pred map
- D : GT
- E : Pred forground RGB slot attention map
- F : Pred background RGB slot attention map
- G : Pred forground Flow slot attention map
- H : Pred background Flow slot attention map
@InProceedings{Lee_2024_CVPR,
author = {Lee, Minhyeok and Cho, Suhwan and Lee, Dogyoon and Park, Chaewon and Lee, Jungho and Lee, Sangyoun},
title = {Guided Slot Attention for Unsupervised Video Object Segmentation},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {3807-3816}
}