Safe Reinforcement Learning Using Advantage-Based Intervention

Source code for Safe Reinforcement Learning Using Advantage-Based Intervention. Code based on OpenAI's Spinning Up.

Installation Instructions

Except for the notes below, follow the installation instructions here: https://spinningup.openai.com/en/latest/user/installation.html
- Replace conda create -n spinningup python=3.6 and conda activate spinningup with conda create -n safe python=3.6 and conda activate safe, respectively.
- Instead of the "Installing Spinning Up" section, run the following:
```
git clone https://github.com/nolanwagener/safe_rl.git
cd safe_rl
pip install -e .
```
- Also follow the instructions in "Installing MuJoCo".
Go to the extra_envs directory and install it to expose those environments to Gym.

cd extra_envs
pip install -e .

Install the mjmpc and mjrl repos for the MPC-based Half-Cheetah intervention.

git clone https://github.com/mohakbhardwaj/mjmpc.git -b nolan/safe_rl
pip install git+https://github.com/aravindr93/mjrl.git@master#egg=mjrl
cd mjmpc
pip install -e .

Code Architecture

extra_envs: This consists of three folders:
- envs: The point and Half-Cheetah environments.
- wrappers: Defines the intervention wrapper. Whenever we step the wrapped environment, we check whether the agent should be intervened. If not, we step the internal environment. Otherwise, we return a NaN observation and set the done flag to True. If the intervener gives a safe action, the returned info dictionary includes the step output (o, r, d, info) when the safe action from the intervener is applied to the internal environment.
- intervener: Intervention rules G = (Q, μ, η) for the point and Half-Cheetah environments. The rule contains a should_intervene method which uses a heuristic or an advantage-based rule to decide whether a given action requires intervention. The safe_action method returns a safe action from μ.
  - Point: The safe policy μ is a deceleration policy. We have two interveners:
    - PointIntervenerNetwork: Uses value and Q-value approximators to build the advantage function estimate. The networks are loaded using PyTorch.
    - PointIntervenerRollout: Rolls out the deceleration policy on a model of the environment to build the advantage function estimate.
  - Half-Cheetah:
    - HalfCheetahHeuristicIntervener: Merely checks if the predicted next state would result in a constraint violation. The episode immediately terminates upon intervention.
    - HalfCheetahMpcIntervener: Uses a modeled environment and sampling-based MPC to form the safe policy. Similarly builds an advantage-function estimate using MPC. Upon intervention, can either reset the environment or return an action from MPC.
safe_rl/algos:
- cppo: The PPO algorithm modified for the constrained/safe setting. Our implementation maintains a value function for the reward and a value function to predict the constraint cost or an intervention (here, overloaded into the same scalar). Thus, CPPO can be used for both the constrained setting where a Lagrange multiplier is optimized and the unconstrained safe setting where we receive a fixed penalty for an intervention.
- csc: The constrained PPO algorithm but with a state-action critic used for the cost in place of the state critic. The state-action "safety" critic is used to filter out unsafe proposed actions, and is trained in a conservative fashion to make the agent more safe.

Scripts

All training scripts are located in /scripts/ within two 'script.sh' files. After running the scripts, the results are stored in /data. Results can be plotted using the "Spinning Up" plotting tool (with spinup.run replaced with safe_rl.run): https://spinningup.openai.com/en/latest/user/plotting.html

Citing

If you use this repo in your research, please cite:

@inproceedings{wagener2021safe,
  title={{Safe Reinforcement Learning Using Advantage-Based Intervention}},
  author={Wagener, Nolan and Boots, Byron and Cheng, Ching-An},
  booktitle={International Conference on Machine Learning},
  pages={10630--10640},
  year={2021},
  organization={PMLR}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
extra_envs		extra_envs
safe_rl		safe_rl
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Safe Reinforcement Learning Using Advantage-Based Intervention

Installation Instructions

Code Architecture

Scripts

Citing

About

Languages

License

nolanwagener/safe_rl

Folders and files

Latest commit

History

Repository files navigation

Safe Reinforcement Learning Using Advantage-Based Intervention

Installation Instructions

Code Architecture

Scripts

Citing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages