Problem:
Can we use Experience instead for solving the MDP?
Monte Carlo Methods: broad class of computational algorithms that rely on repeated random sampling to obtain numerical results.
Previously we computed value functions from knowledge of the MDP, here we learn value functions from sample returns with the MDP.
Problem => Maintaining Exploration
Solution => Assumption of Exploring Starts
Note: This assumption cannot be relied upon in general, particularly when learning directly from actual interaction with an environment. We will assume this and go ahead, and later explore how we can do away with this.
How to do without Infinite Episodes for learning?
How to do without Exploring Starts Assumption?
Policy being learned about is called the target policy,
Policy used to generate behavior is called the behavior policy.
Hence, Learning is from data “off” the target policy, and the overall process is termed off-policy learning.
As the data is due to a different policy, off-policy methods are often
off-policy methods are more powerful and general.(On-policy is a subset of Off-policy)
They can be applied to learn from data generated by a conventional non-learning controller, or from a human expert.
Given $\pi$ is the target policy, $\mu$ is the behavior policy,
Both policies are considered fixed and given.
To evaluate action vaules following $\pi$ given episodes from following policy $\mu$ ,
Importance sampling: A general technique for estimating expected values under one distribution given samples from another.
The ordinary importance-sampling estimator is unbiased whereas the weighted importance-sampling estimator is biased
The variance of the ordinary importance-sampling estimator is in general unbounded, whereas in the weighted estimator,assuming bounded returns, the variance of the weighted importance-sampling estimator converges to zero.
Weighted estimator usually has dramatically lower variance and is strongly preferred.
One issue to watch for: maintaining sufficient exploration