- TokenFusion Description
- Model Architecture
- Dataset
- Environment Requirements
- Script Description
- Model Description
- Description of Random Situation
- ModelZoo Homepage
TokenFusion is a multimodal token fusion method tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images.
Paper: Yikai Wang, Xinghao Chen, Lele Cao, Wenbing Huang, Fuchun Sun, Yunhe Wang. Multimodal Token Fusion for Vision Transformers. In CVPR 2022.
The overall architecture of TokenFusion is show below:
Dataset used: NYUDv2
- Dataset size:colorful images and depth images, with labels in 40 segmentation classes
- Train:795 samples
- Test:654 samples
- Data format:image files
- Note:Data will be processed in utils/datasets.py
- Hardware (GPU)
- Framework
- For more information, please check the resources below:
.TokenFusion
├── README.md # descriptions about TokenFusion
├── models
│ ├── mix_transformer.py # definition of backbone model
│ ├── segformer.py # definition of segmentation model
│ └── modules.py # TokenFusion operations
├── utils
│ ├── datasets.py # data loader
│ ├── helpers.py # utility functions
│ ├── transforms.py # data preprocessing functions
│ └── meter.py # utility functions
├── eval.py # evaluation interface
├── cfg.py # configure file
├── config.py # configure file
To Be Done
# infer example
python eval.py --checkpoint_path [CHECKPOINT_PATH]
Checkpoint can be downloaded at here or Mindspore Hub.
result: IoU=54.8, ckpt= ./tokenfusion_ascend_v180_nyudv2_research_cv_acc54.8.ckpt
Parameters | Ascend |
---|---|
Model | TokenFusion |
Model Version | tokenfusion_seg_mitb3_nyudv2 |
Resource | Ascend 910 |
Uploaded Date | 2022-08-10 |
MindSpore Version | 1.8.0 |
Dataset | NYUDv2 |
Outputs | probability |
Accuracy | 1pc: 54.8% |
Speed | 1pc:1s/step |
We set the seed inside datasets.py.
Please check the official homepage.