Dissecting “Reinforcement Learning” by Richard S. Sutton with custom Python implementations, Episode V
In our previous post, we wrapped up the introductory series on fundamental reinforcement learning (RL) techniques by exploring Temporal-Difference (TD) learning. TD methods merge the strengths of Dynamic Programming (DP) and Monte Carlo (MC) methods, leveraging their best features to form some of the most important RL algorithms, such as Q-learning.
Building on that foundation, this post delves into n-step TD learning, a versatile approach introduced in Chapter 7 of Sutton’s book [1]. This method bridges the gap between classical TD and MC techniques. Like TD, n-step methods use bootstrapping (leveraging prior estimates), but they also incorporate the next n
rewards, offering a unique blend of short-term and long-term learning. In a future post, we’ll generalize this concept even further with eligibility traces.
We’ll follow a structured approach, starting with the prediction problem before moving to control. Along the way, we’ll:
- Introduce n-step Sarsa,
- Extend it to off-policy learning,
- Explore the n-step tree backup algorithm, and
- Present a unifying perspective with n-step Q(σ).
As always, you can find all accompanying code on GitHub. Let’s dive in!