Planning and learning with tabular methods

Reference: Sutton and Barto Chapter 8

Contents

1) Introduction

2) Model Based RL

3) Dyna: Integrating Planning, Acting, and Learning

4) Prioritizing Sweeps

5) Planning as a part of action Selection(MCTS)

Introduction

  • Model-Free RL
    • No model
    • Learn value function (and/or policy) from experience
  • Model-Based RL
    • Learn a model from experience
    • Plan value function (and/or policy) from mode

In this Chapter we will learn how to learn a model directly from the environment and use planning to construct a value function/policy.

What is a model?

A model of the environment we mean anything that an agent can use to predict how the environment will respond to its actions.

Some models produce a description of all possibilities and their probabilities; these we call distribution models.

Other models produce just one of the possibilities, sampled according to the probabilities; these we call sample models.

What is planning?

We use the term Planning to refer to any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment.

state-space planning- Planning is viewed primarily as a search through the state space for an optimal policy or path to a goal.

plan-space planning- Planning is instead viewed as a search through the space of plans.Plan-space planning includes evolutionary methods and “partial-order planning".

In state space planning, two fundamental ideas:

1) All state-space planning methods involve computing value functions as a key intermediate step toward improving the policy, and

2) They compute their value functions by backup operations applied to simulated experience.

Multi-armed Bandit

For eg: Dynamic Programming.

Difference between Planning and Learning:

The difference is that whereas planning uses simulated experience generated by a model, learning methods use real experience generated by the environment.

Model Based RL

Multi-armed Bandit
  • Advantages:
    • Can efficiently learn model by supervised learning methods
    • Can reason about model uncertainty
  • Disadvantages:
    • First learn a model, then construct a value function
    • Two sources of approximation error

The model

  • We will assume state space $S$ and action space $A$ are known.
  • A model $M$ is a representation of an MDP $\{S, A,P, R\}$ parametrized by $\eta$.
  • So a model $M = \{P_\eta,R_\eta\}$, where
    • $P_\eta \approx P$ represents state transitions
    • and $R_\eta \approx R$ represents the rewards.
  • $$ S_{t+1} ~ P_\eta(S_{t+1} | S_t, A_t)$$ $$ R_{t+1} = R_\eta(R_{t+1} | S_t, A_t)$$
  • Typically assume conditional independence between state transitions and rewards:
    $$ P [S_{t+1}, R_{t+1} | St , At ] = P [S_{t+1} | St, At] P [R_{t+1} | St , At] $$

Model Learning

Goal: estimate model $M_\eta$ from experience $\{S_1, A_1, R_2, ..., S_T \}$

  • This is a supervised learning problem
    • $S_1, A_1 \rhd R_2, S_2$
    • $S_2, A_2 \rhd R_3, S_3$
    • .
    • .
    • .
    • $S_{T−1}, A_{T−1} \rhd R_T , S_T$
  • Learning $s, a \rhd r$ is a regression problem,
  • Learning $s, a \rhd s$ is a density estimation problem.

Pick loss function, e.g. mean-squared error, KL divergence, ...
Find parameters $\eta$ that minimise empirical loss

Examples of Models

  • Table Lookup Model

  • Linear Expectation Model

  • Linear Gaussian Model

  • Gaussian Process Model

  • Deep Belief Network Mode

Table Lookup Model

  • Model is an explicit MDP, $\hat{P}, \hat{R}$
  • Count visits $N(s, a)$ to each state action pair

    • #### $$ \hat{P}^a_{s,s'} = \frac{1}{N(s, a)} \sum^{T}_{t=1}{\textbf{1}(S_t, A_t, St+1 = s, a,s')}$$

    • #### $$ \hat{R}^a_{s} = \frac{1}{N(s, a)} \sum^T_{t=1}{\textbf{1}(S_t , A_t = s, a)R_t} $$

  • Alternatively,
    • At each time-step $t$, record experience tuple
      $\{S_t,A_t, R_{t+1}, S_{t+1}\}$
    • To sample model, randomly pick tuple matching $\{s,a,.,.\}$

Planning with a Model

  • Given a model $M_\eta = \{P_\eta, R_\eta\}$
  • Solve the MDP $\{S, A,P_\eta, R_\eta \}$
  • Using favourite planning algorithm
    • Value iteration
    • Policy iteration
    • ...

Sample-Based Planning

  • A simple but powerful approach to planning
  • Use the model only to generate samples
  • Sample experience from model $$ S_{t+1} ∼ P_\eta(S_{t+1} | S_t , A_t)$$ $$ R_{t+1} = R_\eta(R_{t+1} | S_t , A_t)$$
  • Apply model-free RL to samples, e.g.:
    • Monte-Carlo control
    • Sarsa
    • Q-learning
  • Sample-based planning methods are often more efficient

Disadvantages:

  • We mostly have an imperfect model $\{P_\eta, R_\eta\} \approx \{P, R\}$
  • Performance of model-based RL is limited to optimal policy for approximate MDP$\{S, A,P_\eta, R_\eta \}$
  • i.e. Model-based RL is only as good as the estimated model When the model is inaccurate, planning process will compute a suboptimal policy.
  • Solution 1: when model is wrong, use model-free RL
  • Solution 2: reason explicitly about model uncertainty

Dyna: Integrating Planning, Acting, and Learning

Real and Simulated Experience

  • We consider two sources of experience
    • Real experience Sampled from environment (true MDP) $$ S' ~ P^a_{s,s'}$$ $$ R = R^a_s $$
    • Simulated experience Sampled from model (approximate MDP) $$ S' ~ P_\eta(S'|S,A) $$ $$ R' = R_\eta(S'|S,A) $$
  • Real Experience has two roles:
    • Model Learning.
    • Direct Reinforcement learning.

Dyna

Multi-armed Bandit

Tabular Dyna-Q Algorithm

Multi-armed Bandit

Results

Multi-armed Bandit

Results

Multi-armed Bandit

When the model is wrong

  • In previous section, the model started out empty, and was then filled only with exactly correct information.
  • In general, Models may be incorrect because
    • The environment is stochastic.
    • Only a limited number of samples have been observed.
    • The model was learned using function approximation that has generalized imperfectly.
    • The environment has changed and its new behavior has not yet been observed.

Examples

Multi-armed Bandit

Examples

Multi-armed Bandit

Dyna +

  • Agent keeps track for each state–action pair how many time steps elapsed since last tried in a real interaction.
  • Greater the time elapsed, the greater (we might presume) the chance that the dynamics of this pair has changed and that the model of it is incorrect.
  • To encourage behavior that tests long-untried actions,
    • a special “bonus reward” is given on simulated experiences involving these actions.
    • if the modeled reward $r$, and $\{S,A\}$ has not been tried in $\tau$ time steps,
      $r = r + κ\sqrt{\tau}$ , for some small κ.

Prioritized Sweeps

Introuction

  • If simulated transitions are generated uniformly, then many wasteful backups will be made before stumbling onto one of these useful ones.
  • In the much larger problems that are our real objective, the number of states is so large that an unfocused search would be extremely inefficient.
  • In general, we want to work back from any state whose value has changed.

'Backward focusing' of planning computations

  • the agent discovers a change in the environment and changes its estimated value of one state.
  • The only useful one-step backups are those of actions that lead directly into the one state whose value has been changed.
  • Hence we simulate those steps first.
  • The frontier of useful backups propagates backward, it often grows rapidly.
  • Hence we prioritize the backups according to a measure of their urgency.
  • This is the idea behind prioritized sweeping.

Prioritized sweeping for a deterministic environment Algorithm

Multi-armed Bandit

Results

Multi-armed Bandit

Rod Maneuvering Task

  • 14,400 possible states(36 angles, 20x20 Space)
  • Actions: AlongRod,PerpRod,RotateL,RotateR
  • Ackwardly placed immovable obstacles.
Multi-armed Bandit

Planning as Part of Action Selection

Introduction

  • The planning seen till now:
    • The gradual improvement of a policy/value function that is good in all states generally,
    • Not focused on any particular state.
    • Eg: Dyna, DP.
  • A different outlook:
    • planning is begun and completed after encountering each new state $S_t$.
    • output is not really a policy,rather a single decision, the action $A_t$.
    • next step the planning begins anew with $S_{t+1}$ to produce $A_{t+1}$, and so on.

Some general Features of PPAS

  • the values and policy are specific to the current state and its choices,
  • They are typically discarded after being used to select the current action.
  • useful in applications in which fast responses are not required.
  • Forward search paradigm using sample-based planning
  • Simulate episodes of experience from current time step with the model
  • Apply model-free RL to simulated episodes
Multi-armed Bandit
  • Simulate episodes of experience from 'now' with the model ### $$ \{ s_t^k,A^k_t, R^k_{t+1},... S^k_T \} ^K_{k=1} \sim M_\nu $$

  • Apply model-free RL to simulated episodes

    • Monte-Carlo control → Monte-Carlo search
    • Sarsa → TD search
Multi-armed Bandit
  • With a perfect model and an imperfect action-value function, then deeper search will usually yield better policies.

Monte Carlo Tree Search

  • One of the recent and successful developments in planning.
  • Broken into two steps:
    • Evaluation
    • Simulation

Evaluation

Multi-armed Bandit

Simulation

Multi-armed Bandit

breakdown of MCTS

Multi-armed Bandit

Steps:

1) Selection: A node in the current tree is selected by some informed means as the most promising node from which to explore further.

2) Expansion: The tree is expanded from the selected node by adding one or more nodes as its children.

  • Each new child node is now a leaf node of the partial game tree, meaning
  • None of its possible moves have been visited yet in constructing the tree.

breakdown of MCTS

Multi-armed Bandit

3) Simulation: From one of these leaf nodes, or another node having an unvisited move,a move is selected as the start of a simulation, or rollout, of a complete game.

  • Moves are selected by a rollout policy.

4) Backpropagation: The result of the simulated game is backed up to update statistics attached to the links in the partial game tree traversed.

Example

Multi-armed Bandit
Multi-armed Bandit
Multi-armed Bandit
Multi-armed Bandit
Multi-armed Bandit
Multi-armed Bandit

Advantages of MCTS

  • Highly selective best-first search
  • Evaluates states dynamically (unlike e.g. DP)
  • Uses sampling to break curse of dimensionality
  • Works for “black-box” models (only requires samples)
  • Computationally efficient, anytime, parallelisable