Chapter 6 Temporal-Difference Learning 🔗 Notes
Empirically compare the prediction abilities of TD(0) and
constant-$\alpha$ MC when applied to the Random Walk environment; The left graph below shows the values learned after various numbers of episodes on a single run of TD(0), The right graph shows learning curves for the two methods for various values of
Batch-updating versions of TD(0) and constant-
A standard gridworld with start goal states, and a crosswind running upward through the middle of the grid. The actions are the standard four — up, down, right, and left — in the middle region the resultant next states are shifted upward by a “wind,” the strength of which varies from column to column. Code
- Train records:
- Result:
This gridworld example compares Sarsa and Q-learning, highlighting the difference between on-policy (Sarsa) and off-policy (Q-learning) methods. This is a standard undiscounted, episodic task, with start and goal states, and the usual actions causing movement up, down, right, and left. Reward is −1 on all transitions except those into the region marked “The Cliff.” Stepping into this region incurs a reward of −100 and sends the agent instantly back to the start. Code
- The reward record of Q-learning and SARSA:
- Result: SARSA
- Result: Q-learning
Interim and asymptonic performance of TD control methods on the cliff-walking task as a function of
Comparison of Q-learning and Double Q-learning on a simple episodic MDP. Q-learning initially learns to take the left action much more often than the right action, and always takes it significantly more often than the 5% minimum probability enforced by
Re-solve the windy gridworld assuming eight possible actions, including the diagonal moves, rather than four. Can also include the ninth action that causes no movement at all other than that caused by wind. Code
- Train records:
- Result:
Re-solve the windy gridworld with King's move, assuming that the effect of the wind, if there's any, is stochastic, sometimes varying by 1 from the mean values given for each column. Code
- Train records:
- Result: