Dynamic: sequential or temporal component to the problem
Programming: optimising a “program”, i.e. a policy
Dynamic Programming is a very general solution method for problems which have two properties:
#### Travelling salesman problem:
Markov decision processes satisfy both properties
Dynamic programming (DP) in MDP world refers to a collection of algorithms that can be used to compute optimal policies
We assume that the environment is a finite MDP:
DP possible in Continuous Spaces as well, but more complex.(Will be seen in future Chapters)
Let $\pi$ and $\pi'$ be any pair of deterministic policies such that, for all s $\epsilon$ S, $$q_\pi(s,\pi'(s)) \geq v_{\pi}(s)$$ Then the policy $ \pi '$ must be as good as, or better than $\pi$. That is it must obtain greater or equal expected return from all states s $\epsilon$ S: $$v_{\pi'}(s) \geq v_{\pi}(s)$$
Any optimal policy can be subdivided into two components:
Three simple ideas for asynchronous dynamic programming:
1) Proof of convergence for value iteration and policy iteration.
2) Examples of Asynchronous DP in action
1) How do we know that value iteration converges to the optimal values for the given policy $v_*$?
2) Or that iterative policy evaluation converges to $v_\pi$?
3) Is the solution unique?
Similarly,
Asynchronous DP Algorithms outlined previously:
We will look into one example on how prioritised Sweeping can help improve Policy Evaulations.