Skip to content

Latest commit

 

History

History
62 lines (42 loc) · 6.38 KB

README.md

File metadata and controls

62 lines (42 loc) · 6.38 KB

MaskFormer

MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation

Introduction

Official Repo

Code Snippet

Abstract

Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.

Usage

pip install "mmdet>=3.0.0rc4"

Results and models

ADE20K

Method Backbone Crop Size Lr schd Mem (GB) Inf time (fps) Device mIoU mIoU(ms+flip) config download
MaskFormer R-50-D32 512x512 160000 3.29 A100 42.20 44.29 - config model | log
MaskFormer R-101-D32 512x512 160000 4.12 A100 34.90 45.11 - config model | log
MaskFormer Swin-T 512x512 160000 3.73 A100 40.53 46.69 - config model | log
MaskFormer Swin-S 512x512 160000 5.33 A100 26.98 49.36 - config model | log

Note:

  • All experiments of MaskFormer are implemented with 8 V100 (32G) GPUs with 2 samplers per GPU.
  • The results of MaskFormer are relatively not stable. The accuracy (mIoU) of model with R-101-D32 is from 44.7 to 46.0, and with Swin-S is from 49.0 to 49.8.
  • The ResNet backbones utilized in MaskFormer models are standard ResNet rather than ResNetV1c.
  • Test time augmentation is not supported in MMSegmentation 1.x version yet, we would add "ms+flip" results as soon as possible.

Citation

@article{cheng2021per,
  title={Per-pixel classification is not all you need for semantic segmentation},
  author={Cheng, Bowen and Schwing, Alex and Kirillov, Alexander},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  pages={17864--17875},
  year={2021}
}