Decision Theory and Reinforcement Learning

Decision Theory

An agent observes the environment and performs an action that changes the environment
An example: a player can make moves in a game
A game tree defines all the possible actions taken by the agents
Utility defines a score for the leaf nodes of a game tree

Minimax

When the game tree is known, minimax decision rule defines the optimal sequence of actions
Assumes that the other player always chooses the branch with minimum utility
Always chooses the branch with maximum utility

Markov Decision Process

A framework for modeling decision-making problems
Can be solved using Monte Carlo tree search or reinforcement learning
An MDP is composed of states, actions, and transition probabilities $p(s’ \mid s, a)$
Generally, the result of each action is not known in advance (the environment is stochastic)
Markovian transition model: the probability of reaching state $s’$ when performing action $a$ in state $s$ depends only on the current state $s$

Reward

In each state the agent receives reward $R(s)$
Alternatively, we may consider the expected reward $R(s, a, s’)$ after taking the action $a$

Policy

Policy is a mapping from states to actions $π(s)$, a probability of taking an action $π(a \mid s)$, or e.g. a search process
Executing the same policy will generally not end up in the same state

Utility / Return

Future rewards are discounted by a factor of gamma per time step
Utility or return is the cumulative discounted reward $U = R(s_0) + γ R(s_1) + γ^2 R(s_2) + …$ of a sequence of states

Value

It is useful to assess states with respect to the expected utility
The value function $v_π(s)$ is the expected utility at state $s$ by following policy $π$
The state-action value function $q^π(s, a)$ is the expected utility at state $s$ after taking action $a$ and then following policy $π$.

Optimal Policy

Optimal policy is the one that maximizes the expected utility
Bellman equation: utility of a state is the reward from that state plus the expected utility of the next state, assuming optimal policy

$$ U(s) = R(s) + γ \max_a \sum_{s’} P(s’ \mid s, a) U(s’) $$

Monte Carlo Tree Search

Simulates game play (moving down in the search tree) starting from the current game state according to a fast rollout policy (e.g. randomly)
After e.g. a given time has elapsed, evaluates the end state
The node where the simulation started is marked visited
Propagates the evaluation result up to the root of the game tree and update all the nodes along the way
Each node stores statistics on how promising it is (reward) and how well it has been explored (number of visits)
Upper confidence bound (UCD) function decides whether to exploit the promising nodes or explore less visited nodes when choosing how to expand the tree
May also use a policy network to provide prior probabilities for different moves

Reinforcement Learning

A very general framework for phrasing reward-related learning
The agent cannot necessarily observe the whole environment
Feedback in form of rewards
The agent learns the best behaviour by observing the state space
A state can be defined as a summary of previous actions, observations, and rewards

Compared to Supervised Learning

Typically the objective is a discrete metric so a gradient is not available
Decisions are sequential (the next input depends on the current decision)
The agent is not told which action to take
The rewards may be delayed in time

Problematic Areas

Continuous control
Sparse rewards
Long-term correlations

Approaches to Reinforcement Learning

Utility-Based Agents

Learn a utility function on states
Model the environment to know the state to which an action will lead

Action-Value Methods

Learn the optimal state-action value function $q^*(s, a)$ (the maximum $q$ under any policy) and use it to design a good policy
Can be used without a model of the environment (state transition and reward probability functions)
Cannot look ahead when it’s unknown where the action leads
Off-policy agents learn the optimal policy while performing actions of another policy to ensure adequate exploration
On-policy agents execute the same policy that they learn

Policy-Gradient Methods

Optimize a parameterized policy (for example a neural network) by gradient descent
Policy gradient theorem reformulates the loss in such a way that only the policy needs to be differentiable
A value function may be used to learn the policy parameters, but is not required for action selection

Actor-Critic Methods

An actor chooses an action to take and a critic evaluates being in a state
For example a neural network with actor and critic output layers
Policy-gradient is a special case of this more general idea, where the policy is the actor and the reward is used as the critic

Reflex Agents

Learn a mapping from states to actions

Utility Estimation

The agent performs trials
Each trial provides a sample of the expected reward-to-go from the states it visited
Disregard knowledge about the dependency of the utility between states (Bellman equation)

Q-Learning

Q-learning is an off-policy action-value method
The optimal state-action value function $q^*(s, a)$ predicts the maximum (under any strategy) expected utility in state $s$ after action $a$
The policy is to select the action with maximal $q$
$q$ obeys the Bellman equation—can be used to define an iterative update so that $q_i$ will converge to $q^*$:

$$ q_{i+1}(s, a) = R(s) + γ \max_{a'} \sum_{s’} P(s’ \mid s, a) q_i(s’, a’) $$

Can operate in discrete state and action spaces only
Policy is not differentiable—action probabilities change abruptly when a different action has the maximal value

Dynamic Programming Approach to Q-Learning

Think of $q$ as the length of a path segment in a graph
Use dynamic programming to find the shortest path
Requires a model of the environment

Sample-Based Approaches to Q-Learning

Don’t require a model of the environment

Deep Q-Learning

Calculating $q(s, a)$ exactly for every state and action is impractical except in trivial cases.
$q$ can be approximated as a neural network with weights $w$: $q(s, a, w)$
Deep Q-learning minimizes the squared error of the predicted $q$ value and the target, which is the same as the iterative update $q_{i+1}(s, a)$
The target is nonstationary—it changes at each iteration when the weights are updated

Policy-Gradient Methods

Define an explicit policy
The policy gives smooth action probabilities and is differentiable
Don’t (necessarily) learn the value function
Effective in high-dimensional / continuous action spaces
No need for a specific set of rules for the policy
Popular for robotics

REINFORCE

REINFORCE is the first and most well known policy-gradient method
In the context of sequence generation, the policy is a model that is used to generate the most likely sequence
The generated sequence is compared to the optimal (ground truth) sequence to compute the reward
The policy is updated based on the gradient of the loss, which is the negative expected reward
The expected reward is approximated using one or more samples

Seppo Enarvi

Decision Theory

Minimax

Markov Decision Process

Reward

Policy

Utility / Return

Value

Optimal Policy

Monte Carlo Tree Search

Reinforcement Learning

Compared to Supervised Learning

Problematic Areas

Approaches to Reinforcement Learning

Utility-Based Agents

Action-Value Methods

Policy-Gradient Methods

Actor-Critic Methods

Reflex Agents

Utility Estimation

Q-Learning

Dynamic Programming Approach to Q-Learning

Sample-Based Approaches to Q-Learning

Deep Q-Learning

Policy-Gradient Methods

REINFORCE