Name		Name	Last commit message	Last commit date
parent directory ..
envs		envs
history/figure_6_3		history/figure_6_3
plots		plots
README.md		README.md
[NOTES]CH_6.pdf		[NOTES]CH_6.pdf
example_6_2_random_walk.py		example_6_2_random_walk.py
example_6_3_batch_updating.py		example_6_3_batch_updating.py
example_6_5_windy_gridworld.py		example_6_5_windy_gridworld.py
example_6_6_clif_walking.py		example_6_6_clif_walking.py
example_6_7_max_bias.py		example_6_7_max_bias.py
exercise_6_10_stochastic_wind.py		exercise_6_10_stochastic_wind.py
exercise_6_9_windy_king_move.py		exercise_6_9_windy_king_move.py
figure_6_3_TD_methods_performance.py		figure_6_3_TD_methods_performance.py

README.md

Chapter 6 Temporal-Difference Learning 🔗 Notes

Examples

6.2 Random Walk (p.125)

Empirically compare the prediction abilities of TD(0) and constant-$\alpha$ MC when applied to the Random Walk environment; The left graph below shows the values learned after various numbers of episodes on a single run of TD(0), The right graph shows learning curves for the two methods for various values of $\alpha$. Code

6.3 Random Walk under batch updating (p.126)

Batch-updating versions of TD(0) and constant- $\alpha$ MC were applied as follows to the random walk prediction example (Example 6.2). After each new episode, all episodes seen so far were treated as a batch. They were repeatedly presented to the algorithm, either TD(0) or constant- $\alpha$ MC, with $\alpha$ sufficiently small that the value function converged. Code

6.5 Windy Gridworld (p.130)

A standard gridworld with start goal states, and a crosswind running upward through the middle of the grid. The actions are the standard four — up, down, right, and left — in the middle region the resultant next states are shifted upward by a “wind,” the strength of which varies from column to column. Code

Train records:

Result:

6.6 Cliff Walking (p.132)

This gridworld example compares Sarsa and Q-learning, highlighting the difference between on-policy (Sarsa) and off-policy (Q-learning) methods. This is a standard undiscounted, episodic task, with start and goal states, and the usual actions causing movement up, down, right, and left. Reward is −1 on all transitions except those into the region marked “The Cliff.” Stepping into this region incurs a reward of −100 and sends the agent instantly back to the start. Code

The reward record of Q-learning and SARSA:

Result: SARSA

Result: Q-learning

Figure 6.3 Performance of TD methods on Cliff Walking (p.133)

Interim and asymptonic performance of TD control methods on the cliff-walking task as a function of $\alpha$. All algorithms used an $\varepsilon$-greedy policy with $\varepsilon = 0.1$. Asymptotic performance is an average over 100,000 episodes whereas interim performance is an average over the first 100 episodes. Code

6.7 Comparison of Q-learning and Double Q-learning on A Simple MDP (p.135)

Comparison of Q-learning and Double Q-learning on a simple episodic MDP. Q-learning initially learns to take the left action much more often than the right action, and always takes it significantly more often than the 5% minimum probability enforced by $\varepsilon$-greedy action selection with $\varepsilon$= 0.1. In contrast, Double Q-learning is essentially unaffected by maximization bias. These data are averaged over 10,000 runs. Code

Exercise

6.9 Windy Gridworld with King's Move (p.131)

Re-solve the windy gridworld assuming eight possible actions, including the diagonal moves, rather than four. Can also include the ninth action that causes no movement at all other than that caused by wind. Code

Train records:

Result:

6.10 Windy Gridworld with Stochastic Wind (p.131)

Re-solve the windy gridworld with King's move, assuming that the effect of the wind, if there's any, is stochastic, sometimes varying by 1 from the mean values given for each column. Code

Train records:

Result:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chapter_06_temporal_difference_learning

chapter_06_temporal_difference_learning

README.md

Chapter 6 Temporal-Difference Learning 🔗 Notes

Examples

6.2 Random Walk (p.125)

6.3 Random Walk under batch updating (p.126)

6.5 Windy Gridworld (p.130)

6.6 Cliff Walking (p.132)

Figure 6.3 Performance of TD methods on Cliff Walking (p.133)

6.7 Comparison of Q-learning and Double Q-learning on A Simple MDP (p.135)

Exercise

6.9 Windy Gridworld with King's Move (p.131)

6.10 Windy Gridworld with Stochastic Wind (p.131)

Files

chapter_06_temporal_difference_learning

Directory actions

More options

Directory actions

More options

Latest commit

History

chapter_06_temporal_difference_learning

Folders and files

parent directory

README.md

Chapter 6 Temporal-Difference Learning 🔗 Notes

Examples

6.2 Random Walk (p.125)

6.3 Random Walk under batch updating (p.126)

6.5 Windy Gridworld (p.130)

6.6 Cliff Walking (p.132)

Figure 6.3 Performance of TD methods on Cliff Walking (p.133)

6.7 Comparison of Q-learning and Double Q-learning on A Simple MDP (p.135)

Exercise

6.9 Windy Gridworld with King's Move (p.131)

6.10 Windy Gridworld with Stochastic Wind (p.131)