This repository contains re-implementations of Deep RL algorithms for continuous action spaces. Some highlights:
- Code is readable, and written to be easy to modify for future research. Many popular Deep RL frameworks are highly modular, which can make it confusing to identify the changes in a new method. Aside from universal components like the replay buffer, network architectures, etc., each implementation in this repo is contained in a single file.
- Train and test on different environments (for generalization research).
- Built-in Tensorboard logging, parameter saving.
- Support for offline (batch) RL.
- Quick setup for benchmarks like Gym MuJoco, Atari, PyBullet, and DeepMind Control Suite.
Paper: Continuous control with deep reinforcement learning, Lillicrap et al., 2015.
Description: a baseline model-free, offline, actor-critic method that forms the template for many of the other algorithms here.
Code: deep_control.ddpg
(with extra comments for an intro to deep actor-critics)
Examples: examples/basic_control/ddpg_gym.py
Paper: Addressing Function Approximation Error in Actor-Critic Methods, Fujimoto et al., 2018.
Description: Builds off of DDPG and makes several changes to improve the critic's learning and performance (Clipped Double Q Learning, Target Smoothing, Actor Delay). Also includes the TD regularization term from "TD-Regularized Actor-Critic Methods."
Code: deep_control.td3
Examples: examples/basic_control/td3_gym.py
Other References: author's implementation
Paper: Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor, Haarnoja et al., 2018.
Description: Samples actions from a stochastic actor rather than relying on added exploration noise during training. Uses a TD3-like double critic system. We do implement the learnable entropy coefficient approach described in the follow-up paper. This version also supports the self-regularized crticic updates from GRAC (see below).
Code: deep_control.sac
Examples: examples/dmc/sac.py
, examples/sacd_demo.py
Other References: Yarats and Kostrikov's implementation, author's implementation.
Paper: Measuring Visual Generalization in Continuous Control from Pixels, Grigsby and Qi, 2020
Description: This is a pixel-specific version of SAC with a few tricks/hyperparemter settings to improve performance. We include many different data augmentation techniques, including those used in RAD, DrQ and Network Randomization. The DrQ augmentation is turned on by default, and has a huge impact on performance.
Please Note: If you are interested in control from images, these features are implemented much more thoroughly in another repo: jakegrigsby/super_sac
Code: deep_control.sac_aug
Examples: examples/dmcr/sac_aug.py
Other References: SAC+AE code, RAD Procgen code, DrQ
Paper: GRAC: Self-Regularized Actor-Critic, Shao et al., 2020.
Description: GRAC is a combination of a stochastic policy with TD3-like stability improvements and CEM-based action selection like you'd see in Qt-Opt or CAQL.
Code: deep_control.grac
Examples: examples/dmc/grac.py
Other References: author's implementation
Paper: Randomized Ensemble Double Q-Learning: Learning Fast Without a Model
Description: Extends the double Q trick to random subsets of a larger critic ensemble. Reduced Q function bias allows for a much higher replay ratio. REDQ is sample efficient but slow (compared to other model-free methods). We implement the SAC version.
Code: deep_control.redq
Examples: examples/dmc/redq.py
Paper: DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction, Kumar et al., 2020.
Description: Reduce the effect of inaccurate target values propagating through the Q-function by learning to estimate the target networks' inaccuracies and adjusting the TD error accordingly. Implemented on top of standard SAC.
Code: deep_control.discor
Examples: examples/dmc/discor.py
Paper: SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep Reinforcement Learning, Lee et al., 2020.
Description: Extends SAC using an ensemble of actors and critics. Adds UCB-based exploration, ensembled inference, and a simpler weighted bellman backup. This version does not use the replay buffer masks from the original.
Code: deep_control.sunrise
Examples: examples/dmc/sunrise.py
Description: A simple approach to offline RL that trains the actor network to emulate the action choices of the demonstration dataset. Uses the stochastic actor from SAC and some basic ensembling to make this a reasonable baseline.
Code: deep_control.sbc
Examples: examples/d4rl/sbc.py
Paper: Accelerating Online Reinforcement Learning with Offline Datasets, Nair et al., 2020. & Critic Regularized Regression, Wang et al., 2020.
Description: TD3 with a stochastic policy and a modified actor update that makes better use of offline experience before finetuning in the online environment. The current implementation is a mix between AWAC and CRR. We allow for online finetuning and use standard critic networks as in AWAC, but add the binary advantage function, and max/mean advantage estimates from CRR. The actor_per
experience prioritization trick is discussed in A Closer Look at Advantage-Filtered Behavioral Cloning
in High-Noise Datasets, Grigsby and Qi, 2021.
Code: deep_control.awac
Examples: examples/d4rl/awac.py
Paper: Towards Automatic Actor-Critic Solutions to Continuous Control, Grigsby et al., 2021
Description: AAC uses a genetic algorithm to automatically tune the hyperparameters of SAC. A population of SAC agents is trained in parallel with a shared relay buffer and several design decisions that reduce hyperparameter sensitivity while (mostly) preserving sample efficiency. Please refer to the paper for more details. This is the official author implementation.
Code: deep_control.aac
git clone https://github.com/jakegrigsby/deep_control.git
cd deep_control
pip install -e .
see the examples
folder for a look at how to train agents in environments like the DeepMind Control Suite and OpenAI Gym.