- Works because reward signal is super weak and ES is highly parallelizable
- Doesn't work when reward signal is strong: training a simple MNIST classification network using ES is “1000 times slower.”
- Not the end-all-be-all: better training methods still need the gradient (eg. predicting the future state of the environment)
- Papers:
- Blog posts: