Everyone and their grandmothers have heard about the success of deep learning on challenging tasks like beating humans at the game of Go or on Atari Games ☺. The key underlying principle for the success of course is using reinforcement learning. But, what is the mathematical principle behind this game? The key insight necessary for understanding how we make decisions under uncertainty is based on the principle of a Markov Decision Process or an MDP in short. In this article we aim to understand MDPs

Let us first start by playing a game!

Consider the game is as often the case associated with the roll of a dice.When the game starts you have the following situation:

- You can choose to immediately quit the game and you get paid ￡8

The Dice Game

- You can choose to continue the game. If you do so, then a dice will be rolled
- If the dice shows 1 or 2 then you go to the end of the game and you are paid ￡4
- If any other number shows then you are paid ￡4 and you go back to the start of the game
- You need to make a single decision as to what your policy will be when you are at the start of the game and that is applied always

What would you choose, would you choose to stay in the game or to quit? If you decide to stay in the game, then what would your expected earnings be? Would you earn more than ￡8. We can answer all these queries using the notion of an MDP. And let me assure you, it is useful for many more things other than frivolous unlikely dice games ☺

Let us visualize the different states that are possible for this simple game in the figure below

MDP states for the Dice Game

What we have illustrated is the Markov Decision Process view for the dice game. It shows the different states (In, End), the different actions (stay, quit) and the different rewards (8,4) as well as the transition probabilities (2/3, 1/3 and 1) for the different actions. In general the Markov decision process captures the states and actions for an agent and the fact that these actions are affected by the environment and can stochastically result in some new state. This is illustrated in the following figure.

Markov Decision Process (Figure from Wikipedia based on the figure in Sutton and Barto’s book)

An MDP consists of

•a set of states (S_t ,S_t’ )

•A set of actions from state S_t such as A_t

•Transition probability P(S_t’ |S_t ,A_t)

•Reward for the transition R(S_t’ ,S_t ,A_t)

- A discount factor γ

# Transition Probabilities

One of the interesting aspects that governs an MDP is the specification of the transition probabilities

•The transition probability P(S_t’ |S_t ,A_t) specifies the probability of ending in state S_t’ from state S_t given a particular action A_t

•For a given state and action the transition probabilities should sum to 1

- For example: P(End |In ,Stay) = 1/3 and P( In|In,Stay) = 2/3
- P(End|In,Quit) = 1

Once we have specified the MDP, our aim is to obtain good policies to obtain the best value. After all, we want to maximize our earnings on dice games☺

# Policy

Let us first define precisely what we mean by a policy.

- A policy π is a mapping from a state S_t to an action A_t
- When we adopt a policy, we follow a random path depending on the transition probabilities (as specified above).
- The utility of a policy is the (discounted) sum of the rewards on the path.

For instance, the following table provides examples of the various paths and the utilities that we obtain by following the policy to choose the action “stay” when we are at the “in” node.

Possible paths and the utilities obtained by the policy to stay in the game

We would like to optimize and obtain a policy that maximizes our potential to obtain high utility. However, clearly, we cannot optimize on the utility of any particular path itself as it is a random variable. What we do optimize is on the “expected utility”. While the utility of a particular random path cannot be optimized on, we can actually optimize on the expected utility.

*The value of a policy is the expected utility. We choose to obtain the best policy by optimizing this quantity*

# Discounting

When we specified the MDP, we mentioned that one of the parameters is the discount factor. Let us clarify what we mean by that now. We have clarified the utility for a policy. Now we can account for the discount factor.

•The utility with discount factor γ is u=r_1+ γr_2+γ² r_3+γ³ r4+⋯

•Discount of γ = 1 implies that a future reward has the same value as a present reward

•Discount of γ = 0 implies that a future reward has no value

- Discount of 0<γ<1 implies a discount on the future indicated by the value of γ

# Value of states vs actions

Value of a state

The value of a state (v_π (s) ) depends on the values of the actions possible and how likely each action is to be taken under a current policy π (for e.g. V(in) = choice of Q(in,stay) or Q(in,quit)

Q-value — Value of a state-action pair

The value of an action — termed Q value (q_π (s,a) ) depends on the expected next reward and the expected sum of the remaining rewards. The differences between the two types of value functions will be clarified further once we consider an example.

# Q-Value evaluation

We first start by understanding Q-value. Let us first consider the case that we have been given a specific policy π. In that case, the policy of the state in is easily obtained as

Now we can obtain the expression for the Q-value as

The expected next reward is calculated for each of the next states that is possible. The reward is obtained as the transition probability to go to the particular state and the reward on going to that next state. Additionally we obtained the discounted value of the next state as providing us with the remaining expected rewards by reaching that next state. This is illustrated in the figure below

# Q-Value evaluation Example

Let us evaluate the Q-value for the policy where we choose the action ‘stay’ when we are at the ‘in’ state

The state diagram for the dice game

When we reach the end state, the value is 0 as we are already at the end state and no further rewards are obtained. Thus V_π (end)=0

For the other cases, when we are not at the end state, the value is obtained as

Value for the specific ‘in’ case

The values 1/3 and 2/3 are provided by the transition probabilities. The reward for reaching ‘end’ state or ‘in’ state is 4. We then obtain the expected utility of the subsequent state i.e. the ‘end’ state or ‘in’ state. From this we obtain

Calculation for V(in)

Thus the expected value to ‘stay in’ the game results in a value of 12. This is greater than the value to quit and so the optimal policy would be stay in the game

# Optimal value and policy

So far we have assumed that we are provided a particular policy. Our goal is to obtain the maximum expected utility for the game in general. We can do so by finding the optimal value V_opt(S) which is the maximum value attained by any policy. *How do we find this?*

We can do so by a simple modification to our policy evaluation step. For a fixed policy, we calculated the value as

Now, this becomes

Optimal policy evaluation

The corresponding Q-value is obtained as

This is very similar to our earlier evaluation of the Q-value. The main difference is that we incorporate the optimal value for the future states s’

# Optimal Value and Policy Example

Let us now consider the dice game. If we are not in the end state, then we have two options for the actions, i.e. either to stay in or to quit

The optimal policy would be calculated as

V_opt = max(Q(in,stay),Q(in,quit))

Q(in,quit) = 1*8 + 0 as the transition probability to go from in to end is 1 if we decide to quit and the reward is 8. Thus Q(in,quit) = 8

Q(in,stay) = 12 as calculated previously, i.e.

Thus V_opt = max(Q(in,stay),Q(in,quit)) = max(12,8) = 12 and the chosen action would be to stay in the game

We have so far easily calculated the value through a recursive solution. In some cases, the policy evaluation may not be possible in a closed form as there may be many states and transitions. We then opt for an iterative policy using Bellman’s iterative policy evaluation as one of the possible options.

To conclude, we considered the task of understanding a *Markov Decision Process *and have considered this in detail using an example. A very good resource to understand this topic further is the lecture on this topic by Dorsa Sadigh in CS 221 at Stanford over here. The dice game example is based on this lecture. Another excellent reference for understanding this topic in detail is the book on Reinforcement Learning by Sutton and Barto. Note that the book is available for free with the accompanying code.