Replication package for the paper "Boundary State Generation for Testing and Improvement of Autonomous Driving Systems"
Artifacts accompanying the paper "Boundary State Generation for Testing and Improvement of Autonomous Driving Systems" submitted for publication at Transaction on Software Engineering.
Our tool GenBo mutates the initial driving conditions of the ego vehicle to find boundary state pairs in a failure-free environment scenario. In the next step, GenBo uses such boundary state pairs to improve the driving model. The figure below shows the two steps of the approach.
The raw data, the models, and the driving simulators are stored here.
-
Install anaconda for your system;
-
Create the environment:
conda create -n genbo python=3.8
-
Activate the environment:
conda activate genbo
-
Downgrade the version of
setuptools
andwheel
to be compatible with thegym==0.21.0
library:pip install setuptools==65.5.0 "wheel<0.40.0"
-
Downgrade the version of
pip
to enable a correct installation of thegym==0.21.0
library:pip install "pip<24.2"
-
Install the requirements:
pip install -r requirements.txt
Download the simulator from here.
Once the libraries are installed, and the simulator downloaded type:
# Linux desktop
python donkey_test.py --donkey-exe-path ~/donkey-genbo-linux/donkey_sim.x86_64 \
--donkey-scene-name generated_track --track-num 0
# Mac desktop
python donkey_test.py --donkey-exe-path ~/donkey-genbo-mac/donkey_sim.app \
--donkey-scene-name generated_track --track-num 0
The program should print an image like the following on logs/donkey_test.png
:
Assuming that the commands below will be executed on a Mac (CPU: 2.4 GHz 8-Core Intel Core i9, GPU: Intel UHD Graphics 630 1536 MB, RAM: 64 GB 2667 MHz DDR4).
The following command search for boundary state pairs (called frontier pairs in the code) of the weakest model donkey-dave2-m1.h5
(the search is executed on the training track track_0
, which is called T1
in the paper). The length-exponential-factor
variable in the code corresponds to the parameter L
in the paper.
cd scripts
conda activate genbo
bash -i search.sh --model donkey-dave2-m1.h5 \
--donkey-exe-path ~/donkey-genbo-mac/donkey_sim.app \
--num-iterations 10 --num-restarts 2 --seed 0
It should find, on average, one boundary state pair in about 15 minutes.
The following command search for boundary state pairs of the second weakest model donkey-dave2-m2.h5
.
cd scripts
conda activate genbo
bash -i search.sh --model donkey-dave2-m2.h5 \
--donkey-exe-path ~/donkey-genbo-mac/donkey_sim.app \
--num-iterations 10 --num-restarts 10 --seed 0
It should find, on average, one boundary state pair in about 60 minutes (note that the number of restarts increased from 2
to 10
since the model is stronger).
The data for the first step of the search are here.
To run the recovery procedure type:
cd scripts
conda activate genbo
bash -i replay.sh --model-to-test donkey-dave2-m2.h5 \
--model-tested donkey-dave2-m1.h5 --execute-failure-member false \
--donkey-exe-path ~/donkey-genbo-mac/donkey_sim.app
bash -i replay.sh --model-to-test donkey-dave2-m2.h5 \
--model-tested donkey-dave2-m1.h5 --execute-failure-member true \
--donkey-exe-path ~/donkey-genbo-mac/donkey_sim.app
to execute the m2
model on the boundary state pairs of the m1
model (first on the non-failure members of the pairs, then on the failure members); the recovery percentages should be respectively 100% and 80/100%. Add the --video
option to save videos of the recovery (each individual of the pair is executed five times).
Likewise run the following commands:
cd scripts
conda activate genbo
bash -i replay.sh --model-to-test donkey-dave2-m1.h5 \
--model-tested donkey-dave2-m2.h5 --execute-failure-member false \
--donkey-exe-path ~/donkey-genbo-mac/donkey_sim.app
bash -i replay.sh --model-to-test donkey-dave2-m1.h5 \
--model-tested donkey-dave2-m2.h5 --execute-failure-member true \
--donkey-exe-path ~/donkey-genbo-mac/donkey_sim.app
to execute the m1
model on the boundary state pairs of the m2
model (first on the non-failure members of the pairs, then on the failure members); the recovery percentages should be respectively 40% and 0%.
Now suppose that we want to improve the second model m2
.
First we need to run the autopilot on the boundary state pairs of m2
and use it to label the images on those boundary conditions. Since the m2
struggles on the failure members of the pairs, we execute the autopilot only there.
cd scripts
conda activate genbo
bash -i replay.sh --model-to-test autopilot \
--model-tested donkey-dave2-m2.h5 --execute-failure-member true \
--collect-images true --donkey-exe-path ~/donkey-genbo-mac/donkey_sim.app
The second step is to retrain the model on the newly created dataset (assuming it is named donkey-2024_06_20_16_50_58-finetuning-agent-dave2-m2-generated_track-0*.npz
, placed under logs
) merged with the original dataset (that you can download from here, it is under dataset
; the original dataset should be placed under logs
). Type:
cd scripts
conda activate genbo
bash -i ./retrain_models.sh --finetuning-archive donkey-2024_06_20_16_50_58-finetuning-agent-dave2-m2-generated_track-0 \
--original-archive donkey-archive-agent-autopilot-episodes-6-generated_track-0 \
--model-name-suffix donkey-dave2-m2
to retrain the m2
model. On CPU it takes a long time (on a MacBook with the specs described above it takes around 4 hours). if you have an NVIDIA GPU and docker with the NVIDIA container toolkit installed, you can use the Dockerfile
in this repository to build a Docker container and run the previous command within it:
docker build -t genbo:dev .
docker run --rm --gpus all -u ${UID} -it --mount type=bind,source="$(pwd)",target=/home/genbo --workdir /home/genbo --name bostage-container genbo:dev bash
cd scripts
bash -i ./retrain_models.sh --finetuning-archive donkey-2024_06_20_16_50_58-finetuning-agent-dave2-m2-generated_track-0 \
--original-archive donkey-archive-agent-autopilot-episodes-6-generated_track-0 \
--model-name-suffix donkey-dave2-m2
exit
On an Ubuntu machine with a Quadro RTX 4000 with 8GB of VRAM the retraining takes around 1 hour. The command redirects the output to logs/models/donkey-dave2-m2-run-1.txt
; the model is saved in logs/models/donkey-dave2-m2-run-1
, while the plot of the training and validation loss is placed in logs/models/donkey-dave-2-run-1.pdf
. The total number of epochs is set to 1k, but the training stops once the validation loss does not improve for more than 10 epochs.
The donkey-archive-agent-autopilot-episodes-6-generated_track-0
file contains the images and labels of six episodes of the autopilot driving on track_0
, and it is used to train the dave2 models. Optionally, it can be obtained using the following command (it takes around 1 minute):
cd scripts
conda activate genbo
bash -i ./evaluate.sh --track-num 0 --num-episodes 6 --max-steps 1400 \
--donkey-exe-path ~/donkey-genbo-mac/donkey_sim.app
The last step is to evaluate the original and retrained models on eight evaluation tracks (plus the original training track, i.e., track_0
). We first evaluate the original m2
model on all the tracks as follow:
cd scripts
conda activate genbo
bash -i ./evaluate.sh --model donkey-dave2-m2.h5 --num-episodes 5 --max-steps 200 \
--donkey-exe-path ~/donkey-genbo-mac/donkey_sim.app
The script runs the model on nine tracks, including the original, for five episodes of maximum 200 steps (i.e., the model succeeds in an episode if it runs for 200 steps). The default is num_episodes=100
and max_steps=1000
. This needs to be executed only once when there are multiple retrainings. We then run the retrained model on all the tracks as follow:
cd scripts
conda activate genbo
bash -i ./evaluate.sh --model donkey-dave2-m2-run-1 --num-episodes 5 \
--max-steps 200 --num-run 1 --donkey-exe-path ~/donkey-genbo-mac/donkey_sim.app
Since we have only one run, i.e., one search run producing the boundary state pairs, we use num_run=1
. The command needs to be repeated for each retraining run.
The last step consists of processing the evaluation logs as follow:
cd scripts
# process logs of original model
./process_evaluation_logs.sh --original true --model donkey-dave2-m2
# process logs of retrained models
./process_evaluation_logs.sh --model donkey-dave2-m2
The second command looks for all retrained models stemming from donkey-dave2-m2
, i.e., donkey-dave2-m2-run-{i}
. In this case there is only one retrained model of donkey-dave2-m2
, i.e., donkey-dave2-m2-run-1
. The two commands print the success rates on the original and the seven evaluation tracks, sorted in this way. It should print the following:
------------ donkey-dave2-m2 -----------
DEBUG:collect_images:Success rate: 1.00
DEBUG:collect_images:Success rate: 0.00
DEBUG:collect_images:Success rate: 0.00
DEBUG:collect_images:Success rate: 0.00
DEBUG:collect_images:Success rate: 0.00
DEBUG:collect_images:Success rate: 0.00
DEBUG:collect_images:Success rate: 0.00
DEBUG:collect_images:Success rate: 0.00
DEBUG:collect_images:Success rate: 0.00
------------ donkey-dave2-m2-run-1 -----------
DEBUG:collect_images:Success rate: 1.00
DEBUG:collect_images:Success rate: 0.40
DEBUG:collect_images:Success rate: 0.80
DEBUG:collect_images:Success rate: 0.40
DEBUG:collect_images:Success rate: 0.40
DEBUG:collect_images:Success rate: 0.20
DEBUG:collect_images:Success rate: 0.40
DEBUG:collect_images:Success rate: 0.20
DEBUG:collect_images:Success rate: 0.00
To cite this repository in publications:
@article{DBLP:journals/tse/BiagiolaT24,
author = {Matteo Biagiola and
Paolo Tonella},
title = {Boundary State Generation for Testing and Improvement of Autonomous
Driving Systems},
journal = {{IEEE} Trans. Software Eng.},
volume = {50},
number = {8},
pages = {2040--2053},
year = {2024},
url = {https://doi.org/10.1109/TSE.2024.3420816},
doi = {10.1109/TSE.2024.3420816},
timestamp = {Sun, 08 Sep 2024 16:06:26 +0200},
biburl = {https://dblp.org/rec/journals/tse/BiagiolaT24.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}