1) Introduction
2) Model Based RL
3) Dyna: Integrating Planning, Acting, and Learning
4) Prioritizing Sweeps
5) Planning as a part of action Selection(MCTS)
In this Chapter we will learn how to learn a model directly from the environment and use planning to construct a value function/policy.
A model of the environment we mean anything that an agent can use to predict how the environment will respond to its actions.
Some models produce a description of all possibilities and their probabilities; these we call distribution models.
Other models produce just one of the possibilities, sampled according to the probabilities; these we call sample models.
We use the term Planning to refer to any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment.
state-space planning- Planning is viewed primarily as a search through the state space for an optimal policy or path to a goal.
plan-space planning- Planning is instead viewed as a search through the space of plans.Plan-space planning includes evolutionary methods and “partial-order planning".
In state space planning, two fundamental ideas:
1) All state-space planning methods involve computing value functions as a key intermediate step toward improving the policy, and
2) They compute their value functions by backup operations applied to simulated experience.
For eg: Dynamic Programming.
The difference is that whereas planning uses simulated experience generated by a model, learning methods use real experience generated by the environment.
Goal: estimate model Mη from experience {S1,A1,R2,...,ST}
Pick loss function, e.g. mean-squared error, KL divergence, ...
Find parameters η that minimise empirical loss
Table Lookup Model
Linear Expectation Model
Linear Gaussian Model
Gaussian Process Model
Deep Belief Network Mode
Count visits N(s,a) to each state action pair
#### ˆPas,s′=1N(s,a)T∑t=11(St,At,St+1=s,a,s′)
#### ˆRas=1N(s,a)T∑t=11(St,At=s,a)Rt
Simulate episodes of experience from 'now' with the model ### {skt,Akt,Rkt+1,...SkT}Kk=1∼Mν
Apply model-free RL to simulated episodes
1) Selection: A node in the current tree is selected by some informed means as the most promising node from which to explore further.
2) Expansion: The tree is expanded from the selected node by adding one or more nodes as its children.
3) Simulation: From one of these leaf nodes, or another node having an unvisited move,a move is selected as the start of a simulation, or rollout, of a complete game.
4) Backpropagation: The result of the simulated game is backed up to update statistics attached to the links in the partial game tree traversed.