1) Introduction: Policy Gradient vs the World, Advantages and Disadvantages
2) REINFORCE: Simplest Policy Gradient Method
3) Actor-Critic Methods
4) Additional Enhancements to Actor-Critic Methods
Why?
In the latter case a policy-based method will typically be faster to learn and yield a superior asymptotic policy.
Example: In Robotics Tasks with continuous Action space.
Can we reduce the variance somehow?