Decision Theory

  • An agent observes the environment and performs an action that changes the environment
  • An example: a player can make moves in a game
  • A game tree defines all the possible actions taken by the agents
  • Utility defines a score for the leaf nodes of a game tree


  • When the game tree is known, minimax decision rule defines the optimal sequence of actions
  • Assumes that the other player always chooses the branch with minimum utility
  • Always chooses the branch with maximum utility

Markov Decision Process

  • A framework for modeling decision-making problems
  • Can be solved using Monte Carlo tree search or reinforcement learning
  • An MDP is composed of states, actions, and transition probabilities $p(s’ \mid s, a)$
  • Generally, the result of each action is not known in advance (the environment is stochastic)
  • Markovian transition model: the probability of reaching state $s’$ when performing action $a$ in state $s$ depends only on the current state $s$


  • In each state the agent receives reward $R(s)$
  • Alternatively, we may consider the expected reward $R(s, a, s’)$ after taking the action $a$


  • Policy is a mapping from states to actions $π(s)$, a probability of taking an action $π(a \mid s)$, or e.g. a search process
  • Executing the same policy will generally not end up in the same state

Utility / Return

  • Future rewards are discounted by a factor of gamma per time step
  • Utility or return is the cumulative discounted reward $U = R(s_0) + γ R(s_1) + γ^2 R(s_2) + …$ of a sequence of states


  • It is useful to assess states with respect to the expected utility
  • The value function $v_π(s)$ is the expected utility at state $s$ by following policy $π$
  • The state-action value function $q^π(s, a)$ is the expected utility at state $s$ after taking action $a$ and then following policy $π$.

Optimal Policy

  • Optimal policy is the one that maximizes the expected utility
  • Bellman equation: utility of a state is the reward from that state plus the expected utility of the next state, assuming optimal policy
$$ U(s) = R(s) + γ \max_a \sum_{s’} P(s’ \mid s, a) U(s’) $$
  • Simulates game play (moving down in the search tree) starting from the current game state according to a fast rollout policy (e.g. randomly)
  • After e.g. a given time has elapsed, evaluates the end state
  • The node where the simulation started is marked visited
  • Propagates the evaluation result up to the root of the game tree and update all the nodes along the way
  • Each node stores statistics on how promising it is (reward) and how well it has been explored (number of visits)
  • Upper confidence bound (UCD) function decides whether to exploit the promising nodes or explore less visited nodes when choosing how to expand the tree
  • May also use a policy network to provide prior probabilities for different moves

Reinforcement Learning

  • A very general framework for phrasing reward-related learning
  • The agent cannot necessarily observe the whole environment
  • Feedback in form of rewards
  • The agent learns the best behaviour by observing the state space
  • A state can be defined as a summary of previous actions, observations, and rewards

Compared to Supervised Learning

  • Typically the objective is a discrete metric so a gradient is not available
  • Decisions are sequential (the next input depends on the current decision)
  • The agent is not told which action to take
  • The rewards may be delayed in time

Problematic Areas

  • Continuous control
  • Sparse rewards
  • Long-term correlations

Approaches to Reinforcement Learning

Utility-Based Agents

  • Learn a utility function on states
  • Model the environment to know the state to which an action will lead

Action-Value Methods

  • Learn the optimal state-action value function $q^*(s, a)$ (the maximum $q$ under any policy) and use it to design a good policy
  • Can be used without a model of the environment (state transition and reward probability functions)
  • Cannot look ahead when it’s unknown where the action leads
  • Off-policy agents learn the optimal policy while performing actions of another policy to ensure adequate exploration
  • On-policy agents execute the same policy that they learn

Policy-Gradient Methods

  • Optimize a parameterized policy (for example a neural network) by gradient descent
  • Policy gradient theorem reformulates the loss in such a way that only the policy needs to be differentiable
  • A value function may be used to learn the policy parameters, but is not required for action selection

Actor-Critic Methods

  • An actor chooses an action to take and a critic evaluates being in a state
  • For example a neural network with actor and critic output layers
  • Policy-gradient is a special case of this more general idea, where the policy is the actor and the reward is used as the critic

Reflex Agents

  • Learn a mapping from states to actions

Utility Estimation

  • The agent performs trials
  • Each trial provides a sample of the expected reward-to-go from the states it visited
  • Disregard knowledge about the dependency of the utility between states (Bellman equation)


  • Q-learning is an off-policy action-value method
  • The optimal state-action value function $q^*(s, a)$ predicts the maximum (under any strategy) expected utility in state $s$ after action $a$
  • The policy is to select the action with maximal $q$
  • $q$ obeys the Bellman equation—can be used to define an iterative update so that $q_i$ will converge to $q^*$:
$$ q_{i+1}(s, a) = R(s) + γ \max_{a'} \sum_{s’} P(s’ \mid s, a) q_i(s’, a’) $$
  • Can operate in discrete state and action spaces only
  • Policy is not differentiable—action probabilities change abruptly when a different action has the maximal value

Dynamic Programming Approach to Q-Learning

  • Think of $q$ as the length of a path segment in a graph
  • Use dynamic programming to find the shortest path
  • Requires a model of the environment

Sample-Based Approaches to Q-Learning

  • Don’t require a model of the environment

Deep Q-Learning

  • Calculating $q(s, a)$ exactly for every state and action is impractical except in trivial cases.
  • $q$ can be approximated as a neural network with weights $w$: $q(s, a, w)$
  • Deep Q-learning minimizes the squared error of the predicted $q$ value and the target, which is the same as the iterative update $q_{i+1}(s, a)$
  • The target is nonstationary—it changes at each iteration when the weights are updated

Policy-Gradient Methods

  • Define an explicit policy
  • The policy gives smooth action probabilities and is differentiable
  • Don’t (necessarily) learn the value function
  • Effective in high-dimensional / continuous action spaces
  • No need for a specific set of rules for the policy
  • Popular for robotics


  • REINFORCE is the first and most well known policy-gradient method
  • In the context of sequence generation, the policy is a model that is used to generate the most likely sequence
  • The generated sequence is compared to the optimal (ground truth) sequence to compute the reward
  • The policy is updated based on the gradient of the loss, which is the negative expected reward
  • The expected reward is approximated using one or more samples