Unifying MC Methods and TD Methods¶

Bootstrapping, TD($\lambda$) and Eligibility Traces¶

Reference: Chapter 7 and Chapter 12, Sutton and Barto¶

Contents¶

Bootstrapping

TD ($\lambda$)

Eligibility Traces

Bootstrapping ¶

Bootstrapping and Sampling¶

RewardHypothesis

Why do Bootstrapping?¶

Free from tyranny of the time step
Sometimes updates are required at every step of transition (one-step TD)
- Take into account every possible transition/ anything that has changed
Sometimes, it makes sense to only update every few stansitions (multi-step TD)
- Take into account significant/considerable changes

n-Step Prediction¶

RewardHypothesis

n-Step Return¶

RewardHypothesis

n-Step Prediction¶

RewardHypothesis

n-Step SARSA (On-Policy Control)¶

simply switch states for actions (state–action pairs) and then use an $\epsilon$-greedy policy.
The n-step returns in terms of estimated action values:
Update made to a particular value of action-pair is as follows:

n-Step SARSA (On-policy Control)¶

RewardHypothesis

n-Step SARSA Example¶

RewardHypothesis

n-Step Off-Policy Control (with Importance Sampling)¶

learning the value function for one policy, π, while following another policy, μ
Often, π is the greedy policy for the current action-value-function estimate, and μ is a more exploratory policy, perhaps ε-greedy
we must take into account the difference between the two policies, using their relative probability of taking the actions that were taken
To measure this difference, we use the importance sampling ratio.
Only difference, that instead of measuring it for the entire episode, we measure it for n-steps.

The importance sampling ratio looks liks:
The update Equatin looks like this:

RewardHypothesis

Off-Policy Control (w/o Importance Sampling => Tree BackUp Algorithm)¶

This backup is an alternating mix of sample transitions—from each action to the su bsequent state—and full backups—from each state we consider all the possible actions, their probability of occuring under π, and their action values.

Returns and updates are calculated as follows:

Unified view of Reinforcement Learning¶

RewardHypothesis

TD($\lambda$) ¶

Averaging n-Step Returns¶

RewardHypothesis

$\lambda$ Returns¶

RewardHypothesis

TD(λ) Weighting Function¶

RewardHypothesis

Forward-view TD(λ)¶

RewardHypothesis

Bckward-view TD(λ)¶

RewardHypothesis

Eligibility Traces ¶

Eligibility Traces¶

RewardHypothesis

RewardHypothesis

Backward View TD(λ)¶

RewardHypothesis

TD(λ) and TD(0)¶

RewardHypothesis

TD(λ) and MC¶

RewardHypothesis

MC and TD(1)¶

RewardHypothesis

Telescoping in TD(1)¶

RewardHypothesis

TD(λ) and TD(1)¶

RewardHypothesis

Telescoping in TD(λ)¶

RewardHypothesis

Forwards and Backwards TD(λ)¶

RewardHypothesis

Offline Equivalence of Forward and Backward TD¶

RewardHypothesis

Online Equivalence of Forward and Backward TD¶

RewardHypothesis

Summary of Forward and Backward TD(λ)¶

RewardHypothesis