Instance segmentation aims at dichotomizing a pixel acting as a sub-object of a unique entity in the scene. One of the approaches, which combines object detection and semantic segmentation, is Mask R-CNN. Furthermore, we can also incorporate ViT as the backbone of Mask R-CNN. In this project, the pre-trained ViT-based Mask R-CNN model is fine-tuned and evaluated on the dataset from the Penn-Fudan Database for Pedestrian Detection and Segmentation. With a ratio of 80:10:10, the train, validation, and test sets are distributed.
Leap into this link that harbors a Jupyter Notebook of the entire experiment.
The following table delivers the performance results of ViT-based Mask R-CNN, quantitatively.
Test Metric | Score |
---|---|
mAPbox@0.5:0.95 | 96.85% |
mAPmask@0.5:0.95 | 79.58% |
Loss curves of ViT-based Mask R-CNN on the Penn-Fudan Database for Pedestrian Detection and Segmentation train and validation sets.
Below, the qualitative results are presented.
Few samples of qualitative results from the ViT-based Mask R-CNN model.
- An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Mask R-CNN
- Benchmarking Detection Transfer Learning with Vision Transformers
- TorchVision's Mask R-CNN
- TorchVision Object Detection Finetuning Tutorial
- Penn-Fudan Database for Pedestrian Detection and Segmentation
- PyTorch Lightning