1) Introduction
2) Model Based RL
3) Dyna: Integrating Planning, Acting, and Learning
4) Prioritizing Sweeps
5) Planning as a part of action Selection(MCTS)
In this Chapter we will learn how to learn a model directly from the environment and use planning to construct a value function/policy.
A model of the environment we mean anything that an agent can use to predict how the environment will respond to its actions.
Some models produce a description of all possibilities and their probabilities; these we call distribution models.
Other models produce just one of the possibilities, sampled according to the probabilities; these we call sample models.
We use the term Planning to refer to any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment.
state-space planning- Planning is viewed primarily as a search through the state space for an optimal policy or path to a goal.
plan-space planning- Planning is instead viewed as a search through the space of plans.Plan-space planning includes evolutionary methods and “partial-order planning".
In state space planning, two fundamental ideas:
1) All state-space planning methods involve computing value functions as a key intermediate step toward improving the policy, and
2) They compute their value functions by backup operations applied to simulated experience.
For eg: Dynamic Programming.
The difference is that whereas planning uses simulated experience generated by a model, learning methods use real experience generated by the environment.
Goal: estimate model $M_\eta$ from experience $\{S_1, A_1, R_2, ..., S_T \}$
Pick loss function, e.g. mean-squared error, KL divergence, ...
Find parameters $\eta$ that minimise empirical loss
Table Lookup Model
Linear Expectation Model
Linear Gaussian Model
Gaussian Process Model
Deep Belief Network Mode
Count visits $N(s, a)$ to each state action pair
#### $$ \hat{P}^a_{s,s'} = \frac{1}{N(s, a)} \sum^{T}_{t=1}{\textbf{1}(S_t, A_t, St+1 = s, a,s')}$$
#### $$ \hat{R}^a_{s} = \frac{1}{N(s, a)} \sum^T_{t=1}{\textbf{1}(S_t , A_t = s, a)R_t} $$
Simulate episodes of experience from 'now' with the model ### $$ \{ s_t^k,A^k_t, R^k_{t+1},... S^k_T \} ^K_{k=1} \sim M_\nu $$
Apply model-free RL to simulated episodes
1) Selection: A node in the current tree is selected by some informed means as the most promising node from which to explore further.
2) Expansion: The tree is expanded from the selected node by adding one or more nodes as its children.
3) Simulation: From one of these leaf nodes, or another node having an unvisited move,a move is selected as the start of a simulation, or rollout, of a complete game.
4) Backpropagation: The result of the simulated game is backed up to update statistics attached to the links in the partial game tree traversed.