Skip to content

Latest commit

 

History

History

chapter_06_temporal_difference_learning

Chapter 6 Temporal-Difference Learning     🔗 Notes

Examples

6.2 Random Walk (p.125)

Empirically compare the prediction abilities of TD(0) and constant-$\alpha$ MC when applied to the Random Walk environment; The left graph below shows the values learned after various numbers of episodes on a single run of TD(0), The right graph shows learning curves for the two methods for various values of $\alpha$. Code

6.3 Random Walk under batch updating (p.126)

Batch-updating versions of TD(0) and constant- $\alpha$ MC were applied as follows to the random walk prediction example (Example 6.2). After each new episode, all episodes seen so far were treated as a batch. They were repeatedly presented to the algorithm, either TD(0) or constant- $\alpha$ MC, with $\alpha$ sufficiently small that the value function converged. Code

6.5 Windy Gridworld (p.130)

A standard gridworld with start goal states, and a crosswind running upward through the middle of the grid. The actions are the standard four — up, down, right, and left — in the middle region the resultant next states are shifted upward by a “wind,” the strength of which varies from column to column. Code

  • Train records:

  • Result:

6.6 Cliff Walking (p.132)

This gridworld example compares Sarsa and Q-learning, highlighting the difference between on-policy (Sarsa) and off-policy (Q-learning) methods. This is a standard undiscounted, episodic task, with start and goal states, and the usual actions causing movement up, down, right, and left. Reward is −1 on all transitions except those into the region marked “The Cliff.” Stepping into this region incurs a reward of −100 and sends the agent instantly back to the start. Code

  • The reward record of Q-learning and SARSA:

  • Result: SARSA

  • Result: Q-learning

Figure 6.3 Performance of TD methods on Cliff Walking (p.133)

Interim and asymptonic performance of TD control methods on the cliff-walking task as a function of $\alpha$. All algorithms used an $\varepsilon$-greedy policy with $\varepsilon = 0.1$. Asymptotic performance is an average over 100,000 episodes whereas interim performance is an average over the first 100 episodes. Code

6.7 Comparison of Q-learning and Double Q-learning on A Simple MDP (p.135)

Comparison of Q-learning and Double Q-learning on a simple episodic MDP. Q-learning initially learns to take the left action much more often than the right action, and always takes it significantly more often than the 5% minimum probability enforced by $\varepsilon$-greedy action selection with $\varepsilon$= 0.1. In contrast, Double Q-learning is essentially unaffected by maximization bias. These data are averaged over 10,000 runs. Code

Exercise

6.9 Windy Gridworld with King's Move (p.131)

Re-solve the windy gridworld assuming eight possible actions, including the diagonal moves, rather than four. Can also include the ninth action that causes no movement at all other than that caused by wind. Code

  • Train records:

  • Result:

6.10 Windy Gridworld with Stochastic Wind (p.131)

Re-solve the windy gridworld with King's move, assuming that the effect of the wind, if there's any, is stochastic, sometimes varying by 1 from the mean values given for each column. Code

  • Train records:

  • Result: