Policy Gradient Methods

Table of Content

1) Introduction: Policy Gradient vs the World, Advantages and Disadvantages

2) REINFORCE: Simplest Policy Gradient Method

3) Actor-Critic Methods

4) Additional Enhancements to Actor-Critic Methods

Introduction

Introduction

Multi-armed Bandit

Value-Based Vs Policy-Based RL

Multi-armed Bandit

Why Policy-Based RL

Multi-armed Bandit

Can Learning Policy be easier than Learning Values of states?

  • The policy may be a simpler function to approximate.
  • This is the simplest advantage that policy parameterization may have over action-value parameterization.

Why?

  • Problems vary in the complexity of their policies and action-value functions.
  • For some, the action-value function is simpler and thus easier to approximate.
  • For others, the policy is simpler.

In the latter case a policy-based method will typically be faster to learn and yield a superior asymptotic policy.

Example: In Robotics Tasks with continuous Action space.

Example of Stochastic Optimal Policy

Multi-armed Bandit

Example of Stochastic Optimal Policy

Multi-armed Bandit

Example of Stochastic Optimal Policy

Multi-armed Bandit

Example of Stochastic Optimal Policy

Multi-armed Bandit

Why not use softmax of Action-Values for stochastic Policies?

  • This alone would not approach determinism if and when required.
  • The action-value estimates would differ by a finite amount, translating to specific probabilities other than 0 and 1.
  • If softmax + Temprature Paramenter T: T could be reduced over time to approach determinism.
  • However, in practice it would be difficult to choose the reduction schedule, or even the initial temperature, without more knowledge of the true action values.
  • Whereas, Policy gradient is driven to produce the optimal stochastic policy.
  • If the optimal policy is deterministic, then the preferences of the optimal actions will be driven infinitely higher than all suboptimal actions

REINFORCE: Simplest Policy Gradient Method

Quality Measure of Policy

Multi-armed Bandit

Policy Optimisation

Multi-armed Bandit

Gradient Ascent

Multi-armed Bandit

Gradient Ascent - FDM

Multi-armed Bandit

Analytic Gradient Ascent

Multi-armed Bandit

Example- Softmax Policy

Multi-armed Bandit

Example- Gaussian Policy

Multi-armed Bandit

One-step MDP

Multi-armed Bandit

Policy Gradient Theorem

Multi-armed Bandit

Policy Gradient Theorem-Proof

Multi-armed Bandit

Monte-Carlo Policy Gradient (REINFORCE)

Multi-armed Bandit

PuckWorld Example

Multi-armed Bandit
DQN Demo Reinforce.js

REINFORCE with Baseline

  • REINFORCE has good theoretical convergence properties.
  • The expected update over an episode is in the same direction as the performance gradient.
  • This assures:
    • An improvement in expected performance for sufficiently small $\alpha$, and
    • Convergence to a local optimum under standard stochastic approximation conditions.
  • However,
    • Monte Carlo method REINFORCE may be of high variance, and thus
    • slow to learn.

Can we reduce the variance somehow?

REINFORCE with Baseline

  • The derivative of the quality $\eta(\theta)$ of policy network can be written as
Multi-armed Bandit
  • Instead of using the Rewards/Action Vaules generated directly, we first compare it with a baseline:
Multi-armed Bandit
  • The baseline can be any function, even a random variable,
  • Only Condition: Should not vary with action $a$;
  • Any guesses why?
Multi-armed Bandit
  • Finally the Update equation becomes:
    Multi-armed Bandit

Actor Critic Methods

Reducing Variance Using a Critic

Multi-armed Bandit

Estimating the Action-Value Function

Multi-armed Bandit

Action Value Actor Critic

Multi-armed Bandit

Bias In Actor-Critic Algorithm

Multi-armed Bandit

Compatible Function Approximation

Multi-armed Bandit

Compatible Function Approximation- Proof

Multi-armed Bandit

Additional Enhancements to Actor Critic

Actor Critic with Baseline

Multi-armed Bandit

Estimating the Advantage Function

Multi-armed Bandit

Estimating the Advantage Function

Multi-armed Bandit

Critics at different Time-Scales

Multi-armed Bandit
`

Actors at different Time-Scale

Multi-armed Bandit

Policy Gradient with Eligibility Traces

Multi-armed Bandit

Policy Gradient with Eligibility Traces

Multi-armed Bandit

Summary

Multi-armed Bandit