1) Introduction
2) Multi-arm Bandit Problems
3) Action Value Methods
4) Improving Exploration in Simple Bandit Problem
Reinforcement Learning: Uses information that evaluates the actions taken rather than instructs by giving correct actions(as in supervised Learning).
Non-Associative setting: Learning to act in just one situation.
Example of Associative setting: Driving, amount of time spent reading in a day.
Example of Non-Associative setting: k-Arm Bandit Problem, Animals learning not to fear stuffed animals.
k-Armed Bandit Problem:
Each action has an expected/mean reward given that that action is selected given by $q_* (a) $.
$q_* (a) $ can be written as:
$$ \mathbf{ q_* (a) = E(R_t | A_t = a) } $$Notation
If $\mathbf{q_* (a) }$ is known: we will select action 'a' with highest $\mathbf{q_* (a) }$.
But we do not have $\mathbf{q_* (a) }$.What can we do?
We can find a close estimate $\mathbf{Q_t (a) \approx q_* (a) }$.
At any time step: select action with highest $\mathbf{Q_t (a)}$ ==> Exploiting.
At any time step: select action which does not have the highest $\mathbf{Q_t (a)}$ ==> Exploring.
Why explore?
This is the k-armed bandit problem, where Exploration and exploitation has to be balanced.
Now, we will look at how to solve it.
Simplest way of estimating $Q_t(a)$, given $ Q_t (a) \approx q_* (a) = E(R_t | A_t = a)$
$$ Q_t (a) = \frac{\sum^{t-1}_{i=1}{R_i . 1_{A_i = a}}}{\sum^{t-1}_{i=1}{1_{A_i = a}}} $$If the denominator is zero, then we instead define $Q_t(a)$ as some default value, such as $Q_1(a) = 0$.
Now that we have the estimate $Q_t(a) \forall a \in A$, we need to select an action to take:
Basically, the pseudo code for above algo is:
Initialize $Q_t(a) \quad \forall a \in A$,
At each time step:
$\quad$Select an action $A_t = a $ by greedy / $\epsilon$-greedy action selection
$\quad$Calculate new $Q_t$
Initialize, for a =1 to k:
$\quad$ Q(a) = 0
$\quad$ N(a) = 0
Repeat Forever:
$\quad$ A = $argmax_a Q(a) \quad $ with probabiility $ 1-\epsilon $
$\quad$ $\quad$ or
$\quad$ A = any random Action $\quad$ with probability $\epsilon$
$\quad$ R = reward(A)
$\quad$ N(A) = N(A) +1
$\quad$ Q(A) = Q(A) + $\frac{1}{n}\big[ R_n - Q_n\big]$
We randomly generated k-armed bandit problems with k = 10.
For each bandit problem,, the action values, $q_∗ (a)$, $a = 1, . . . , 10,$ were selected according to a normal (Gaussian) distribution with mean 0 and variance 1.
Then,when a learning method applied to that problem selected action $A_t$ at time $t$, the actual reward $R_t$ was selected from a normal distribution with mean $q_∗ (A_t)$ and variance 1.
When would $epsilon$-greedy method perform significantly better?
In the pseudocode, given by:
Some problems are nonstationary, meaning reward function is not from a constant distribution.
In such cases, weight recent rewards more heavily than long-past ones.
This is done by:
This method is called: exponential, recency-weighted average.
These methods are biased by their initial estimates.
Initial action values can be used as a simple way of encouraging exploration.
However, this boost the exploration only once, in the start. It cannot be very useful in a non-stationary process.
In upper confidence bound (UCB) action selection, the squareroot term is a measure of the uncertainty or variance in the estimate of $a$’s value.
Update Rules
$$ H_{t+1}(A_t) = H_t (A_t) + \alpha(R_t − \overline{R_{t}})(1− \pi_t (A_t)) $$and
$$ H_{t+1} (a) = H_t (a) - \alpha(R_t − \overline{R_{t}})\pi_t (a) \quad \forall a \neq A_t $$$\overline{R_{t}}$ is the average of all the rewards up through and including time t. It is also called the baseline reward.
Which is the best?
For the given problem, UCB performs the best.
No general statement should be made.
Different methods may perform better at different problems!