“The future is independent of the past given the present”
Most Markov reward and decision processes are discounted. Why?
sum over a finite number of terms
sum over an infinite number of terms
We need one convention to obtain a single notation that covers both episodic and continuing tasks.
How to do that?
These can be unified by considering episode termination to be the entering of a special absorbing state that transitions only to itself and that generates only rewards of zero. For example, consider the state transition diagram -
A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.
Given a MDP $M = \left \langle S, A, P, R, \gamma \right \rangle$ and a policy $\pi$
$$P_{s,s'}^{\pi} = \sum_{a \epsilon A} \pi(a|s) P_{ss'}^{a}$$$$R_{s}^{\pi} = \sum_{a \epsilon A} \pi(a|s) R_{s}^{a}$$Task:
Collect Empty soda cans in office
Sensors:
1) Detector : For detecting cans
2) Arm + Gripper : To pick up and place can in onboard bin
We first need to identify States (S), Actions (A) and Rewards (R)
Actions:
1) {Search} - Actively search for a can
2) {Wait} - Remain stationary and wait for someone to bring a can. (Will lose less battery)
3) {Recharge} - Head back home for recharging
States:
1) high - Battery is charged considerably well 2) low - Battery is not charged
Rewards:
1) zero most of the time
2) become positive when the robot secures an empty can,
3) negative if the battery runs all the way down