Roughly speaking, Monte Carlo methods wait until the return following the visit is known, then use that return as a target for $V(S_t)$.
Although DP methods update at every iteration, they use a model of the environment for the update. Basically they use, $p(s',r|s,a)$, the transition probability values.
TD methods need wait only until the next time step.
Input: the policy $\pi$ to be evaluated
Initialize $V(s)$ arbitrarily (e.g., $V(s) = 0, \forall s \in S_+$)
Repeat (for each episode):
$\quad$Initialize $S$
$\quad$Repeat (for each step of episode):
$\quad$$\quad$ $A =$ action given by $\pi$ for $S$
$\quad$$\quad$Take action $A$, observe $R, S'$
$\quad$$\quad V (S) = V (S) + \alpha\big[R + \gamma V (S') − V (S)\big]$
$\quad$$\quad$S = S'
$\quad$until S is terminal
Some important points
Suppose you have a fixed number of episodes,
The following episodes are observed:
1) A, 0, B, 0
2) B, 1
3) B, 1
4) B, 1
5) B, 1
6) B, 1
7) B, 1
8) B, 0
Given this batch of data, what would you say are the optimal predictions, the best values for the estimates $V (A)$ and $V (B)$?
Now, which estimate does TD(0) make and which one does $\alpha-MC$ make?
If the process is seen as an instance of the following markov process:
The First solution is right. This is the estimate that TD(0) makes.
If the process is Markov, we expect that the first answer will produce lower error on future data, even though the Monte Carlo answer is better on the existing data.
First, we estimate $q_\pi (s,a)$ instead of $v(s)$. The TD(0) equation for that would be
(update is done after every transition from a nonterminal state $S_t$).
This rule uses every element of the quintuple of events, $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$, that make up a transition from one state–action pair to the next.
Hence this algorithm is named Sarsa.
Initialize $Q(s, a), \forall s \in S, a \in A(s)$, arbitrarily, and $Q($terminal-state$, ·) = 0$
Repeat (for each episode):
$\quad$Initialize $S$
$\quad$Choose $A$ from $S$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)
$\quad$Repeat (for each step of episode):
$\quad \quad$ Take action $A$, observe $R, S'$
$\quad \quad$ Choose $A'$ from $S'$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)
$\quad \quad$ $Q(S, A) = Q(S, A) + \alpha \big[ R + \gamma Q(S', A') − Q(S, A) \big] $
$\quad \quad$ $S = S'; A = A';$
$\quad$ until $S$ is terminal
As in value-iteration, we take the $Q(S', A')$, corresponding to the maximum from all possible action, instead of the actual action A' taken.
The learned action-value function, $Q$, directly approximates $q∗$, the optimal action-value function, independent of the policy being followed.
Initialize $Q(s, a), \forall s \in S, a \in A(s)$, arbitrarily, and $Q($terminal-state$, ·) = 0$
Repeat (for each episode):
$\quad$Initialize $S$
$\quad$Repeat (for each step of episode):
$\quad \quad$ Choose $A$ from $S$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)
$\quad \quad$ Take action $A$, observe $R, S'$
$\quad \quad$ $Q(S, A) = Q(S, A) + \alpha \big[ R +\gamma {max}_a Q(S', a) − Q(S, A) \big] $
$\quad \quad$ $S = S';$
$\quad$ until $S$ is terminal
Unfortunately, this results in its occasionally falling off the cliff because of the $\epsilon$-greedy action selection.
Sarsa, takes the action selection into account and learns the longer but safer path through the upper part of the grid.
Learning algorithm that is just like Q-learning except that instead of the maximum over next state–action pairs it uses the expected value, taking into account how likely each action is under the current policy.
Initialize $Q_1(s, a),Q_2(s, a), \forall s \in S, a \in A(s)$, arbitrarily,
Initialize $Q_1($terminal-state$, ·) = Q_2($terminal-state$, ·) = 0$
Repeat (for each episode):
$\quad$Initialize $S$
$\quad$Repeat (for each step of episode):
$\quad \quad$ Choose $A$ from $S$ using policy derived from $Q_1$ and $Q_2$ (e.g., $\epsilon$-greedy in $Q_1+Q_2$)
$\quad \quad$ Take action $A$, observe $R, S'$
$\quad \quad$ With 0.5 probability:
$\quad \quad \quad Q_1(S, A) = Q_1(S, A) + \alpha \big[ R +\gamma Q_2(S',argmax_a Q_1(S',a)) − Q_1(S, A) \big] $
$\quad \quad$ Else:
$\quad \quad \quad Q_2(S, A) = Q_2(S, A) + \alpha \big[ R +\gamma Q_1(S',argmax_a Q_2(S',a)) − Q_2(S, A) \big] $
$\quad \quad$ $S = S';$
$\quad$ until $S$ is terminal