Introduction to Reinforcement Learning

Reference: Chapter 1, Sutton and Barto

Contents

  • RL: Formal Definition
  • RL vs Supervised Learning vs Unsupervised learning
  • Important RL Perspectives
    • Goal (Reward Hypothesis)
    • Sequential Decision Making Problem
    • Interaction between Agent and Environment
  • Components of RL Agent
  • RL Problems: Learning and Planning
  • Prediction and Control

RL Formal Definition

"Reinforcement learning is the problem of getting an agent to act in the world so as to maximize its rewards. For example, consider teaching a dog a new trick: you cannot tell it what to do, but you can reward/punish it if it does the right/wrong thing. It has to figure out what it did that made it get the reward/punishment, which is known as the credit assignment problem."

Example1
Example2

RL vs Supervised Learning vs Unsupervised Learning

Supervised Learning

1) A human builds a classifier based on input and output data

2) That classifier is trained with a training set of data

3) That classifier is tested with a test set of data

4) Deployment if the output is satisfactory

To be used when, "I know how to classify this data, I just need you(the classifier) to sort it."

Point of method: To class labels or to produce real numbers

Unsupervised Learning

1) A human builds an algorithm based on input data

2) That algorithm is tested with a test set of data (in which the algorithm creates the classifier)

3) Deployment if the classifier is satisfactory

To be used when, "I have no idea how to classify this data, can you(the algorithm) create a classifier for me?" Point of method: To class labels or to predict

Reinforcement Learning

1) A human builds an algorithm based on input data

2) That algorithm presents a state dependent on the input data in which a user rewards or punishes the algorithm via the action the algorithm took, this continues over time

3) That algorithm learns from the reward/punishment and updates itself, this continues

4) It's always in production, it needs to learn real data to be able to present actions from states

To be used when, "I have no idea how to classify this data, can you classify this data and I'll give you a reward if it's correct or I'll punish you if it's not."

RL vs Supervised Learning

  • Training Examples
    • Supervised Learning: Training Examples of the formNo training examples from a knowledgeable external supervisor (situation together with a label).
    • RL: No such training examples.
  • Objective Functions
    • Supervised Learning: Aim is to extrapolate, or generalize so that it acts correctly in situations not present in the training set.
    • In RL, it is often impractical to obtain examples of desired behavior that are both correct and representative of all the situations and an agent must be able to learn from its own experience.

RL vs Unsupervised Learning

  • Unsupervised Learning is about finding structure hidden in collections of unlabeled data.
  • Uncovering structure in an agent’s experience can certainly be useful in reinforcement learning, but by itself does not address the reinforcement learning agent’s problem of maximizing a reward signal.

Examples of RL

  • Fly stunt manoeuvres in a helicopter
  • Defeat the world champion at Backgammon
  • Manage an investment portfolio
  • Control a power station
  • Make a humanoid robot walk
  • Play many different Atari games better than humans

Important RL Perspectives

Goal of RL (Reward Hypothesis)

RewardHypothesis

Reward Examples

  • stunt manoeuvres in a helicopter
    • +ve reward for following desired trajectory
    • −ve reward for crashing
  • Defeat the world champion at Backgammon
    • +/−ve reward for winning/losing a game
  • Manage an investment portfolio
    • +ve reward for each dollar in bank
  • Control a power station
    • +ve reward for producing power
    • −ve reward for exceeding safety thresholds
  • Make a humanoid robot walk
    • +ve reward for forward motion
    • −ve reward for falling over
  • Play many different Atari games better than humans
    • +/−ve reward for increasing/decreasing score

Sequential Decision Making Problem

  • Goal: select actions to maximise total future reward
  • Actions may have long term consequences
  • Reward may be delayed
  • It may be better to sacrifice immediate reward to gain more long-term reward
  • Examples:
    • A financial investment (may take months to mature)
    • Refuelling a helicopter (might prevent a crash in several hours)
    • Blocking opponent moves (might help winning chances many moves from now)

Interaction between Agent and Environment

RewardHypothesis

Interaction between Agent and Environment

RewardHypothesis

History and State

HistoryandState

Environment State

HistoryandState

Agent State

HistoryandState

Information State

HistoryandState

Fully Observable Environment

HistoryandState

Partially Observable Environment

HistoryandState

Major Components of a RL Agent

An RL agent may include one or more of these components:

  • Policy: agent’s behaviour function
  • Value function: how good is each state and/or action
  • Model: agent’s representation of the environment

Policy

  • A policy is the agent’s behaviour
  • It is a map from state to action, e.g.
  • Deterministic policy: $a = π(s)$
  • Stochastic policy: $π(a|s) = P[A_t = a|S_t = s]$

Value function

  • Value function is a prediction of future reward
  • Used to evaluate the goodness/badness of states
  • And therefore to select between actions, e.g. $$v_π(s) = E_π[R _{t+1} + γ*R_{t+2} + γ^{2}*R_{t+3} + ... | S_t = s]$$

Model

  • A model predicts what the environment will do next
  • Transitions: P predicts the next state (i.e. dynamics)
  • Rewards: R predicts the next (immediate) reward, e.g. $$ P_{ss'} = P[S_{t+1} = s' | S_t = s, A_t = a]$$ $$ R^{a}_{s} = E[R_{t+1} | S_t = s, A_t = a]$$

Maze Example

HistoryandState

Maze Example: Policy

HistoryandState
  • Arrows represent policy $\pi(s)$ for each state s.

Maze Example: Value Function

HistoryandState

Maze Example: Model

HistoryandState

Learning and Planning

Two fundamental problems in sequential decision making:

  • Reinforcement Learning:
    • The environment is initially unknown
    • The agent interacts with the environment
    • The agent improves its policy
  • Planning:
    • A model of the environment is known
    • The agent performs computations with its model (without any external interaction)
    • The agent improves its policy a.k.a. deliberation, reasoning, introspection, pondering, thought, search

Exploration and Exploitation

  • Reinforcement learning is like trial-and-error learning
  • The agent should discover a good policy,
    • From its experiences of the environment,
    • Without losing too much reward along the way
  • Exploration finds more information about the environment
  • Exploitation exploits known information to maximise reward
  • It is usually important to explore as well as exploit (In Detail => Chapter 2).

Examples of Explortion and Exploitation

  • Restaurant Selection
    • Exploitation Go to your favourite restaurant
    • Exploration Try a new restaurant
  • Online Banner Advertisements
    • Exploitation Show the most successful advert
    • Exploration Show a different advert
  • Oil Drilling
    • Exploitation Drill at the best known location
    • Exploration Drill at a new location
  • Game Playing
    • Exploitation Play the move you believe is best
    • Exploration Play an experimental move

Summary

  • We got introduced to the basic terminologies of RL.
  • We got an intuition behind how an RL agent can solve problems.