Reinforcement Learning in Computer Vision

Contents

  • We will look at three Computer Vision tasks, namely-

    • Object Detection
    • Action Detection
    • Visual Tracking
  • For each task we will try answering these questions -
    • What is the task?
    • Can we identify the RL components -
      • The MDP formulation
        • State Space
        • Action Space
        • Reward System
      • Network Architecture
    • Why use RL for this task?

Task 1: Object Detection

Example1

What is the task?

Example1

Some Important KeyPoints -

  • This is one of the fundamental Computer Vision Tasks
  • This has been viewed as a Supervised Learning so far
  • Some of the popular Approaches -
    • Selective Search
    • CPMC
    • Edge Boxes (based on sliding windows)
    • R-CNN
    • Fast R-CNN
    • Faster R-CNN
    • YoLo
    • ...

What's the idea here?

  • A class-specific active detection model that learns to localize target objects known by the system.
  • Model follows a top-down search strategy, which starts by analyzing the whole scene and then proceeds to narrow down the correct location of objects.
  • Achieved by applying a sequence of transformations to a box that initially covers a large region of the image and is finally reduced to a tight bounding box.

Example Output

Example1

How to look it as a RL problem?

Question 1: How to think of this as a Sequential Decision Making Problem?

Answer 1: At each time step, the agent should decide in which region of the image to focus its attention so that it can find objects in a few steps.

Question 2: How to cast this as a Markov Decision Process?

Answer 2: We cast the problem of object localization as a Markov decision process (MDP) since this setting provides a formal framework to model an agent that makes a sequence of decisions. We will try to identify the components of MDP, the set of actions A, a set of states S, and a reward function R.

Identifying MDP Parameters

Action Space?

The set of actions A is composed of eight transformations that can be applied to the box and one action to terminate the search process.

Example1

State Space?

  • State Representation = tuple(o,h)
  • o = feature vector of the observed region
    • ((fc6) output => 4,096 dimensional feature vector to represent its content)
  • h = vector with the history of taken actions
    • The history vector encodes 10 past actions
    • Actions encoded as a 9-dimensional binary vector

Motivation behind the history vector: Useful to stabilize search trajectories that might get stuck in repetitive cycles, improving average precision by approximately 3 percent points.

Rewards?

Motivation:

  • Reward function R is proportional to the improvement that the agent makes to localize an object after selecting a particular action
  • Measured using the Intersection-over-Union (IoU) between the target object and the predicted box at any given time
Example1
Example1
Example1

where, b = be the box of an observable region, and

g = the ground truth box for a target object

Explanation:

  • Given a state s, those actions will be rewarded that result in a higher IOU with the groudtruth, otherwise the actions are penalised.
  • For trigger action, reward is positive if final IOU with groundtruth is greater than a certin threshold, and negative otherwise.
  • This reward scheme is binary r ∈ {−1, +1}, and applies to any action that transforms the box.

Network Architecture

Example1

Some Quantitative Results

Example1

Some Qualitative Results

Example1
Example1

Why to use RL here?

Example1
Example1

Task 2: Action Detection

Example1

What is the task?

Example1
  • Does an action occur?
  • When does the action occur?

Whats the idea here?

  • A class-specific action-detection model.
  • It learns to continually adjust the current region to cover th groundtruth more precisely in a self-adapted way
  • This is achieved by applying a sequence of transformations to a temporal window that is initially placed in the video at random and finally finds and covers action region as large as possible.

Example Output

Example1

How to look at it as a RL problem?

Question 1: How to cast this as a Markov Decision Process?

Answer 1: The problem is cast as a MDP, in which the agent interacts with the environment and makes a sequence of decisions to achieve the settled goal.

  • In our formulation, the environment is the input video clip,
  • in which the agent has an observation of the current video segment, called temporal window,
  • and through the actions he restructures the position or span of the window, to achieve the goal of locating the actions precisely.

Identifying MDP Parameters

Action Space?

Example1
All the actions are regular actions, except jump.

The idea behind this irregular action, is to translate the temporal window to a new position away from the current site to avoid that the agent traps itself round the present location when there is no motion occurin nearby.

State Space?

  • State Repreentation = composed of two components (current temporal window, history of actions taken)
  • current temporal window => to describe the motion within the current window
    • features extracted from C3D CNN model (pre-trained on Sports-1M)
    • ((fc6) output => 4,096 dimensional feature vector to represent its content)
  • history of actions taken = vector with the history of taken actions
    • The history vector encodes 10 past actions
    • Actions encoded as a 7-dimensional binary vector

Motivation behind the history vector: Useful to stabilize search trajectories that might get stuck in repetitive cycles, improving average precision by approximately 3 percent points.

Rewards?

Motivation:

  • Reward function R awards the agent for actions that will bring about the improvement of motion localisation accuracy
  • And, the agent is penalized for actions that leads to the decline of the accuracy
  • Measured using the Intersection-over-Union (IoU) between the current attended temporal window and the groundtruth region of motion
Example1
where, w = current window. g = groundtruth region of motion
Example1

where, w and w' are attended windows corresponding to state s and s' respectively.

n = number of groundtruths within the video

Example1

Network Architecture

Example1

Some Quantitative Results

Example1

Some Qualitative Results

Example1

Why use RL here?

Example1

Task 3: Visual Tracking

Example1

The Task at hand:

Single Line:

  • Visual tracking solves the problem of finding the position of the target in a new frame from the current position.

Definition:

  • Visual Tracking is the process of locating, identifying, and determining the dynamic configuration of one or many moving (possibly deformable) objects (or parts of objects) in each frame of one or several cameras

Tracking Challenges

  • Object Modeling: how to define what an object is in terms that can be interpreted by a computer ?
  • Appearance Change: The observation of an object changes according to many parameters (illumination conditions, occlusions, shape variation...)
  • Kinematic Modelling: How to inject priors on object kinematic and interactions between objects.

Common hurdles

  • Track initaition and termination
  • Occlusion handling
  • Merging/switching
  • Drifting due to wrong update of the target model

Existing Methods

  • Various Classical tracking methods, KLT Tracker etc.
  • Recent CNN-Based:
    • Learning Multi-Domain Convolutional Neural Networks for Visual Tracking CVPR'16
    • Siamese Instance Search for Tracking, CVPR'16
  • They still require computationally inefficient search algorithms, such as sliding window or candidate sampling.

Why RL:

  • No post-processing: bounding box regression
  • Fewer searching steps than sliding window or candidate sampling approaches

Focus of this paper:

Solution to existing works for:

1) Present better algorithm than the existing inefficient search algorithms that explore the region of interest and select the best candidate by matching with the tracking model, and

2) Presenting a method to train using unlabeled frames in a semi-supervised case.

MDP Formulation:

  • Visual tracking solves the problem of finding the position of the target in a new frame from the current position.

  • The proposed tracker dynamically pursues the target by sequential actions

  • The tracker is defined as an agent of whose goal is to capture the target with a bounding box shape.

The MDP Parameters:

State Space:

  • Goal : Capture Required Information.
  • What useful Information must the state capture?
  • State is defined by tuple ($p_t,d_t$)
  • $p_t \in R^{112 \times 112 \times 3}$: denotes the image patch within the bounding box
  • $d_t \in R^{110}$: represents the dynamics of actions denoted by a vector containing the previous $k$ actions at $t$-th iteration

Action Space

  • In this paper: Discrete Actions
  • What can be the actions?
Example1

Transition Probabilities:

  • After decision of action at in state $s_t$, the next state $s_{t+1}$ is obtained by the state transition functions.
  • Will it be a stochastic output?

Reward:

At the termination step $T$, that is, $a_T$ is ‘stop’ action,$r(s_T )$ is assigned by:

Example1

Network Architecture

Example1

Action Selection:

Example1

Results

Example1

Results

Example1

Analysis on the actions.

  • most of the frames require fewer than five actions to pursue the target in each frame!
Example1