Twin Delayed DDPG (TD3) and GAE bug fix (TRPO, PPO1, GAIL)

araffin released this 31 Jul 11:25

· 95 commits to master since this release

New Features

added Twin Delayed DDPG (TD3) algorithm, with HER support
added support for continuous action spaces to action_probability, computing the
PDF of a Gaussian policy in addition to the existing support for categorical stochastic policies.
added flag to action_probability to return log-probabilities.
added support for python lists and numpy arrays in logger.writekvs. (@dwiel)
the info dict returned by VecEnvs now include a terminal_observation key providing access to the last observation in a trajectory. (@qxcv)

Bug Fixes

fixed a bug in traj_segment_generator where the episode_starts was wrongly recorded, resulting in wrong calculation of Generalized Advantage Estimation (GAE), this affects TRPO, PPO1 and GAIL (thanks to @miguelrass for spotting the bug)
added missing property n_batch in BasePolicy.

Others

renamed some keys in traj_segment_generator to be more meaningful
retrieve unnormalized reward when using Monitor wrapper with TRPO, PPO1 and GAIL to display them in the logs (mean episode reward)
clean up DDPG code (renamed variables)

Documentation

doc fix for the hyperparameter tuning command in the rl zoo
added an example on how to log additional variable with tensorboard and a callback

Assets 2