Introduction
Iterative Policy Evaluation Methods
Iterative Control Methods
Average Reward Setting
Reinforcement learning can be used to solve large problems, e.g.
How can we use the methods learnt previously on such huge state-spaces?
For eg: A Tabular methods using only Value Function of states (In half Precision) for the game of Backgammon:
Generalise from seen states to unseen states;
Update parameter w using MC or TD learning
Our ultimate purpose is to use it in finding a better policy.
Generally,if not specified, consider $d(s) = 1, \forall s \in S $
A simple form of generalizing function approximation in which states are grouped together.
One estimated value (one component of the weight vector $\theta$) for each group.
Episodic semi-gradient one-step Sarsa
Episodic n-step Semi-gradient Sarsa
In average reward setting, the quality of a policy $\pi$ is defined as:
The average rate of rewards while following that policy, $\eta (\pi)$
### $$ \eta(\pi) = lim_{T \to \infty}\frac{1}{T} \sum_{t=1}^{T}{E[R_t|A_{0:t-1} \sim \pi]}$$
### $$ \eta(\pi) = \sum_{s}{d_\pi(s) \sum_{a}{\pi(a|s) \sum_{s',r}{p(s,r'|s,a)r}}}$$
$d_\pi$ is the steady-state distribution. ### $$ \sum_{s}{d_\pi(s) \sum_{a}{\pi(a|s,\theta) {p(s'|s,a)}}} = d_\pi(s')$$
In the average-reward setting, returns are defined in terms of differences between rewards and the average reward: ### $$G_t = R_{t+1} - \eta(\pi) + R_{t+2} - \eta(\pi) + ...$$
This is know as the differential return
Differential value functions also have Bellman equations very similar to disconted value functions.
For disconted value functions
There is also a differential form of the two TD errors:
For disconted value functions
In each time step, the customer at the head of the queue is
On the next time step the next customer in the queue is considered.
The queue never empties
A customer can not be served if there is no free server;
Each busy server becomes free with probability p on each time step.
Task: Decide on each step whether to accept or reject the next customer,on the basis of