Example semantic segmentation of People vs Background using one of the included, real-time, architectures (running at 100FPS).
By Andres Milioto et.al @ University of Bonn.
In early 2018 we released Bonnet, which is a real-time, robotics oriented semantic segmentation framework using Convolutional Neural Networks (CNNs). Bonnet provides an easy pipeline to add architectures and datasets for semantic segmentation, in order to train and deploy CNNs on a robot. It contains a full training pipeline in Python using Tensorflow and OpenCV, and it also some C++ apps to deploy a CNN in ROS and standalone. The C++ library is made in a way which allows to add other backends (such as TensorRT).
Back then, most of my research was in the field of semantic segmentation, so that was what the framework was therefore tailored specifically to do. Since then, we have found a way to make things even more awesome, allowing for a suite of other tasks, like classification, detection, instance and semantic segmentation, feature extraction, counting, etc. Hence, the new name of this new framework: "Bonnetal", reflects that this is nothing but the old Bonnet, and then some. Hopefully, the explict et.al. will also spawn further collaboration and many pull requests! ๐
We've also switched to PyTorch to allow for easier mixing of backbones, decoders, and heads for different tasks. If you are still comfortable with just semantic segmentation, and/or you're a fan of TensorFlow, you can still find the original Bonnet here. Otherwise, keep on reading, and I'll try to explain why Bonnetal rules!
DISCLAIMER: I am currently bringing all the functionality out from a previously closed-source framework, so be patient if the task/weights are a placeholder, and send me an email to ask for a schedule on the particular part that you need.
This code provides a framework to mix-match popular, imagenet-trained, backbones with different decoders to achieve different CNN-enabled tasks. All of these have pre-trained imagenet weights when used, that get downloaded by default if the conditions are met.
- Backbones included are (so far):
The main reason for the "lack" of variety of backbones so far is that imagenet pre-training takes a while, and it is pretty resource intensive. If you want a new backbone implemented we can talk about it, and you can share your resources to pretrain it ๐ (PR's welcome ๐)
-
Tasks included are:
The code is (like the original Bonnet) separated into a training part developed in Python, using Pytorch, and a deployment/inference part, which is fully written in C++, and contains the code to run on the robot, either using ROS or standalone.
An nvidia-docker container is provided to run the full framework, and as a dependency check, as well as for the continuous integration. You can check the instructions to run the containers in /docker.
/train contains Python code to easily mix and match backbones and decoders in order to train them for different image recognition tasks. It also contains helper scripts for other tasks such as converting graphs to ONNX for inference, getting image statistics for normalization, class statistics in the dataset, inference tests, accuracy assessment, etc, etc.
/deploy contains C++ code for deployment on edge. Every task has its own library and namespace, and every package is a catkin package. Therefore, each task
has 4 catkin packages:
- A
lib
package that contains all inference files for the library. - A
standalone
package that shows how to use the library linked to a standalone C++ application. - A
ros
package that contains a node handler and some nodes to use the library with ROS for the sensor data message-passing, and - (optionally) a
msg
package that defines the messages required for a specific task, should this be required.
Inference is done either:
- By generating a PyTorch traced model through the python interface that can be infered with the
libtorch
library, both on GPU and CPU, or - By generating an ONNX model through the python interface, that is later picked up by TensorRT, profiled in the individual computer looking at available memory and half precision capabilities, and inferer with the TensorRT engine. Notice that not all architectures are supported by TensorRT and we cannot take responsibility for this, so when you implement an architecture, do a quick test that it works with tensorRT before training it and it will make your life easier.
Imagenet pretrained weights for the backbones are downloaded directly to the backbones in first use, so they never start from scratch. Whenever you use a backbone for a task, if the image is RGB, then the weights from imagenet are downloaded into the backbone (unless a specific pretrained model is otherwise explicitly stated in the parameters).
These are the currently trained models we have:
-
Pretrained Backbones:
-
Classification
-
Semantic Segmentation:
- Persons (super fast, jetson benchmark):
- Cityscapes:
- Synthia:
- Mapillary Vistas:
- Pascal VOC 2012:
- MS-COCO (Panoptic):
Copyright 2019, Andres Milioto, Cyrill Stachniss. University of Bonn.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
The pretrained models with a specific dataset maintain the copyright of such dataset.
- Imagenet: Link
- Synthia: Link
- Cityscapes: Link
- Mapillary: Link
- Berkeley100k: Link
- ApolloScape: Link
- Persons: Link
- Coco: Link
- Pascal: Link
- Crop-Weed (CWC): Link
If you use our framework for any academic work, please cite the original paper.
@InProceedings{milioto2019icra,
author = {A. Milioto and C. Stachniss},
title = {{Bonnet: An Open-Source Training and Deployment Framework for Semantic Segmentation in Robotics using CNNs}},
booktitle = {Proc. of the IEEE Intl. Conf. on Robotics \& Automation (ICRA)},
year = 2019,
codeurl = {https://github.com/Photogrammetry-Robotics-Bonn/bonnet},
videourl = {https://www.youtube.com/watch?v=tfeFHCq6YJs},
}
If you use our Instance Segmentation code, please cite its paper paper:
@InProceedings{milioto2019icra-fiass,
author = {A. Milioto and L. Mandtler and C. Stachniss},
title = {{Fast Instance and Semantic Segmentation Exploiting Local Connectivity, Metric Learning, and One-Shot Detection for Robotics }},
booktitle = {Proc. of the IEEE Intl. Conf. on Robotics \& Automation (ICRA)},
year = 2019,
}
Our networks are either built directly on top of, or strongly based on, the following architectures, so if you use them for any academic work, please give a look at their papers and cite them if you think proper:
- ResNet: Link
- DarkNet: Link
- YoloV3: Link
- MobileNetsV2: Link
- SegNet: Link
- E-Net: Link
- ERFNet: Link
- PSPNet: Link
- DeeplabV3: Link
- Sync Batchnorm. Allows to train bigger nets in multi-gpu setup with larger batch sizes so that batch norm doesn't diverge to something that doesn't represent the data.
- Queueing tool: Very nice queueing tool to share GPU, CPU and Memory resources in a multi-GPU environment.
- Pytorch: The backbone of everything.
- onnx-tensorrt: ONNX graph to TensorRT engine for fast inference.
- nvidia-docker: Docker that allows you to also exploit your nvidia GPU.
-
Andres Milioto
-
Ignacio Vizzo
-
Leonard Mandtler
-
Jens Behley
-
Cyrill Stachniss
This work has partly been supported by the German Research Foundation under Germany's Excellence Strategy, EXC-2070 - 390732324 (PhenoRob). We also thank NVIDIA Corporation for providing a Quadro P6000 GPU partially used to develop this framework.