Policy Gradient Methods¶

Table of Content¶

1) Introduction: Policy Gradient vs the World, Advantages and Disadvantages

2) REINFORCE: Simplest Policy Gradient Method

3) Actor-Critic Methods

4) Additional Enhancements to Actor-Critic Methods

The policy may be a simpler function to approximate.
This is the simplest advantage that policy parameterization may have over action-value parameterization.

Why?

In the latter case a policy-based method will typically be faster to learn and yield a superior asymptotic policy.

Example: In Robotics Tasks with continuous Action space.

This alone would not approach determinism if and when required.
The action-value estimates would differ by a finite amount, translating to specific probabilities other than 0 and 1.

If softmax + Temprature Paramenter T: T could be reduced over time to approach determinism.
However, in practice it would be difficult to choose the reduction schedule, or even the initial temperature, without more knowledge of the true action values.

Whereas, Policy gradient is driven to produce the optimal stochastic policy.
If the optimal policy is deterministic, then the preferences of the optimal actions will be driven infinitely higher than all suboptimal actions

DQN Demo Reinforce.js

REINFORCE has good theoretical convergence properties.
The expected update over an episode is in the same direction as the performance gradient.
This assures:
- An improvement in expected performance for sufficiently small $\alpha$, and
- Convergence to a local optimum under standard stochastic approximation conditions.