Reinforcement Learning - An Introduction, 2nd Edition, written by Richard S. Sutton and Andrew G. Barto, is kind of bible of reinforcement learning. It is a required reading for students and researchers to get the appropriate context of the keep developing field of RL and AI.
Links to get or rent a hardcover or ebook: MIT Press, Amazon (Paperback version if generally not recommended because the poor printing quality).
Although the authors have made the book extremely clear and friendly to readers at each level, this book is honestly still intimidating to RL or ML beginners because of the intense concepts, abstract examples and algorithms, and its volume. Therefore, as an RL researcher, I'm trying to extract key points and implement examples as well as exercises in the book to help more people better understand the valuable knowledge the book generously provides.
My work mainly consists of:
- Turning examples into code and plots that are as close to that of in the book as possible;
- Implementing algorithms in
Python
and testing them with RL playground packages likeGymnasium
; - Take notes and organize them as PDF files per chapter.
Chapter 2: Multi-armed Bandits 🔗 link
This chapter starts with bandit algorithm and introduces strategies like
- A k-armed bandit testbed:
- Parameter study (algorithm comparison) - stationary environment
Chapter 3: Finite Markov Decision Process 🔗 link
This chapter introduces the fundamentals of the Markov Decision Process in finite states like agent-environment interaction, goals and rewards, returns and episodes, and policy and value function. It helps to build up a basic understanding of the components of reinforcement learning.
- Optimal solution to the
gridworld
example:
Chapter 4: Dynamic Programming 🔗 link
The dynamic programming (DP) methods introduced in this chapter includes policy iteration, which consists policy evaluation and policy improvement, and value iteration, which considered a concise and efficient version of policy iteration. The chapter puts up a topic that the evaluation and improvement process compete with each other but also cooperate to find the optimal value function and an optimal policy.
- Jack's Car Rental example
- Gambler's problem
Chapter 5: Monte Carlo Methods 🔗 link
Monte Carlo methods can be used to learn optimal behavior directly from interaction with the environment, with no model of the environment's dynamics. The chapter introduces on-policy MC methods like first-visit Monte Carlo prediction with/without Exploring Starts, and off-policy MC methods like ordinary/weighted importance sampling.
- The infinite variance of ordinary importance sampling
- Racetrack
Chapter 6: Temporal-Difference Learning 🔗 link
This chapter introduced temporal-difference (TD) learning, and showed how it can be applied to the reinforcement learning problem. The TD control methods are classified according to whether they deal with the complication by using and on-policy (SARSA, expected SARSA) or off-policy (Q-learning) approach. The chapter also discussed using double learning method to avoid maximization bias problem.
- Comparison of TD(0) and MC on Random Walk environment
- Interim and Asymptotic Performance of TD methods
Chapter 7: n-step Bootstrapping 🔗link
In progress