a) Extract all category names that existed in the following datasets, please refer to datasets/concept_emb/combined_datasets.txt
and datasets/concept_emb/combined_datasets_category_info.py
b) Extract concept embeddings of category names from CLIP Text Encoder. For your convenience, you can directly download the converted file from Google drive and put it in the dir 'datasets/concept_emb/'
.
c) Alternatively, you can run the code to generate in your server
$ cd UniVS
$ sh tools/clip_concept_extraction/extract_concept_emb.sh
Data preparation follows UNINEXT. Thanks a lot :)
A dataset can be used by accessing DatasetCatalog
for its data, or MetadataCatalog for its metadata (class names, etc).
This document explains how to setup the builtin datasets so they can be used by the above APIs.
Use Custom Datasets gives a deeper dive on how to use DatasetCatalog
and MetadataCatalog
,
and how to add new datasets to them. All register datasets can be founded in univs/data/datasets/builtin.py
.
The datasets are assumed to exist in a directory specified by the environment variable DETECTRON2_DATASETS
.
Under this directory, detectron2 will look for datasets in the structure described below, if needed.
$DETECTRON2_DATASETS/
burst/
coco/
DAVIS/
entityseg/
lvis/
mose/
ovis/
refcoco/
ref-davis/ # only inference
sa_1b/
viposeg/ # only inference
vipseg/
VSPW_480p/ # only inference
ytbvos/
ytvis_2019/
ytvis_2021/
You can set the location for builtin datasets by export DETECTRON2_DATASETS=/path/to/datasets
. If left unset, the default is ./datasets
relative to your current working directory.
Expected dataset structure for SA-1B:
a) You can use SA-1B-Downloader to download it (11M images in total, but a subset is enough)
$ cd datasets
$ ln -s /path/to/your/sa_1b/dataset
sa_1b/
images/
annotations/
annotations_250k/
{annotations_250k_*}.json
b) Split SA-1B images into several sub-files for dataloader, named as annotations_250k_*.json
python datasets/data_utils/split_sa1b_dataset.py
a) Download images and annotations for COCO
b) LVIS uses the COCO 2017 images, and you only need to download annotations here.
coco/
train2017/
val2017/
annotations/
instances_train2017.json
instances_val2017.json
panoptic_train2017.json
panoptic_train2017_cocovid.json (converted)
panoptic_train2017/
# only use images higher than 512p (short edge)
python datasets/data_utils/convert_lvis_to_cocovid.py
lvis/
lvis_v1_train.json
lvis_v1_train_video512p.json (converted)
lvis_v1_val.json
c) Convert COCO panoptic annotations into cocovid format
$ python datasets/data_utils/convert_coco_pan_seg_to_cocovid_train.py
Expected dataset structure for RefCOCO
a) Download processed json files by SeqTR from Google Drive. We need three folders: refcoco-unc, refcocog-umd, and refcocoplus-unc. These folders should be organized as below.
refcoco/
refcoco/
instances_refcoco_{train,val,testA,testB}.json
instances.json
refcoco+/
instances_refcoco+_{train,val,testA,testB}.json
instances.json
refcocog/
instances_refcocoG_{train,val,test}.json
instances.json
b) Convert annotations to cocovid format
$ python datasets/data_utils/convert_refcoco_to_cocovif_1.py
$ python datasets/data_utils/convert_refcoco_to_cocovif_2.py
$ python datasets/data_utils/convert_refcoco_to_cocovif_3.py
Expected dataset structure for EntitySeg-v1.0
a) Download images from Google drive or Hugging face
b) Download annotations from Github
c) Unzip images and annotations, we use panoptic segmentation here
d) Convert to cocovid format
$ python datasets/data_utils/convert_entityseg_inst_seg_to_cocovid_train.py
$ python datasets/data_utils/convert_entityseg_pan_seg_to_cocovid_train.py
e) the data format
entityseg/
annotations/
entityseg_insseg_train_cocovid.json
entityseg_panseg_train_cocovid.json
entityseg_train_{01, 02, 03}.json
images/
Expected dataset structure for YouTubeVIS 2021 or Occluded VIS:
a) Only need to download images and annotations and put them into the path
ytvis_2021/
{train,valid,test}.json
{train,valid,test}/
JPEGImages/
*.jpg
ovis/
{train,valid,test}.json
{train,valid,test}/
JPEGImages/
Expected dataset structure for YouTubeVOS and Ref-YouTubeVOS:
a) Download images and annotations of original YTVOS-2018 dataset
b) Download RefVOS annotations to get meta-expression
c) Convert annotaions to cocovid format, and the data should look like
$ python datasets/data_utils/convert_ytvos_to_cocovid_train.py
$ python datasets/data_utils/convert_ytvos_to_cocovid_val.py
$ python datasets/data_utils/convert_refytvos_to_cocovid_train.py
$ python datasets/data_utils/convert_refytvos_to_cocovid_val.py
ytbvos/
train/
JPEGImages/
Annotations/
val/
JPEGImges/
Annotations/
meta_expressions/ (for refytbvos)
train/
meta_expressions.json
val/
meta_expressions.json
test/
meta_expression.json
meta.json
train.json (after convert, for ytbvos)
valid.json (after convert, for ytbvos)
train_ref.json (after convert)
valid_ref.json (after convert)
Note that VIPSeg is used for training, while VSPW and VIPOSeg are used for inference only.
a) Download the VIPSeg from the official Github and change images to 720P cd datasets/vipseg
$ python change2_720p.py
b) Convert it to cocovid format with original resolution
# original resolution for training
$ python datasets/data_utils/convert_vipseg_to_cocovid.py
c) Convert it to cocovid format with 720p
# 720p resolution for standard inference
$ python datasets/data_utils/convert_vipseg720p_to_cocovid.py
d) Data format of VIPSeg dataset
vipseg/
imgs/
panomasksRGB/
panoptic_gt_VIPSeg_train_cocovid.json # original resolution
panoptic_gt_VIPSeg_val_cocovid.json # original resolution
panoptic_gt_VIPSeg_test_cocovid.json # original resolution
VIPSeg_720P/
imgs/
panomasks/
panomasksRGB/
panoptic_gt_VIPSeg_train_cocovid.json # 720p resolution
panoptic_gt_VIPSeg_val_cocovid.json # 720p resolution
panoptic_gt_VIPSeg_test_cocovid.json # 720p resolution
e) Download datasets from offical websits for VSPW and VIPOSeg, and convert them for evaluation
$ python datasets/data_utils/convert_vspw_to_cocovid_val.py
$ python datasets/data_utils/convert_vspw_to_cocovid_dev.py
$ python datasets/data_utils/convert_viposeg_to_cocovid_val.py
TAO is a federated dataset for Tracking Any Object, containing 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average. BURST recently annotates the instance segmentations.
# a) Move the images from TAO dataset to BURST dataset
$ cd datasets/
$ ln -s /path/to/datasets/TAO /path/to/datasets/BURST
# b) Download the segmentation [annotations](https://omnomnom.vision.rwth-aachen.de/data/BURST/annotations.zip)
# c) Convert datasets
$ python datasets/data_utils/convert_burst_to_cocovid_*.py
# d) the data format
burst/
frames/
train/
val/
annotations/
train/
val/
test/
info/
train_uni.json (converted)
val_first_frame_uni.json (converted)
Expected dataset structure for MOSE
# a) Download data (train.tar.gz)
$ gdown 'https://drive.google.com/uc?id=10HYO-CJTaITalhzl_Zbz_Qpesh8F3gZR'
# b) Convert annotations to cocovid format
$ python datasets/data_utils/convert_mose_to_cocovid_train.py
$ python datasets/data_utils/convert_mose_to_cocovid_val.py
Expected dataset structure for Ref-DAVIS17
a) Downlaod the DAVIS2017 dataset from the website. Note that you only need to download the two zip files DAVIS-2017-Unsupervised-trainval-480p.zip
and DAVIS-2017_semantics-480p.zip
.
b) Download the text annotations from the website and put the zip files in the directory as follows
ref-davis
├── DAVIS-2017_semantics-480p.zip
├── DAVIS-2017-Unsupervised-trainval-480p.zip
├── davis_text_annotations.zip
c) Unzip these zip files
$ cd datasets/ref-davis
$ unzip -o davis_text_annotations.zip
$ unzip -o DAVIS-2017_semantics-480p.zip
$ unzip -o DAVIS-2017-Unsupervised-trainval-480p.zip
d) Preprocess the dataset to Ref-Youtube-VOS format
$ cd ../../ # back to the main directory
$ python datasets/data_utils/convert_davis_to_ytvos.py
e) Finally, unzip the file DAVIS-2017-Unsupervised-trainval-480p.zip
again (since we use mv
in preprocess for efficiency).
$ unzip -o DAVIS-2017-Unsupervised-trainval-480p.zip