deep-rl-paper-notes/pg/README.md at master · evancasey/deep-rl-paper-notes · GitHub

REINFORCE

Introduces the REINFORCE policy gradient algorithm, which directly learns a parameterized policy using error back-propagation.
Used to solve a variety of simple, low-dimensional discrete and continuous control tasks.
Papers:
- Policy Gradient Methods for Reinforcement Learning with Function Approximation
- Reinforcement learning of motor skills with policy gradients
Blog posts:
- http://www.rage.net/~greg/2016-07-05-ActorCritic-with-OpenAI-Gym.html

Deep Deterministic Policy Gradients (DDPG)

Like REINFORCE, DDPG directly learns a parameterized policy in addition to a value function. However, unlike the stochastic policy gradient, it uses a deterministic policy gradient
Applied to ALE and Mujoco tasks
Papers:
- DDPG
- Continuous control
Blog posts:

Trust-region Policy Optimization (TRPO)

Prevents the policy gradient updates from exceeding a KL-divergence bound. Improves learning stability and hyperparameter sensitivity
Applied to ALE and Mujoco tasks, outperforms DDPG on many continuous control Mujoco tasks
Papers:
- TRPO
Blog posts:
- Kv frans
Other
- openai/requests-for-research#22 (comment)

High-dimensional continuous control using generalized advantage estimation

Introduces theoretical framework that shows that advantage params determine bias-variance trade-off

Asynchronous Advantage Actor-Critic

ACER

Papers:
- ACER