Ariel University

Course number: 7061510-1

Lecturer: Dr. Amos Azaria

Edited by: Moshe Hanukoglu

Date: First Semester 2018-2019

This method is called the carrot and stick method.

While learning the agents, we will do the same thing, meaning that when the agent performs a good action, he will be given a positive feedback otherwise given negative feedback.

Important thing to emphasize: the reward depends not only on the action, but also on the state of the environment!

One of the cases that coaches agents by feedback is in games where the agent plays the game alone.

We'll show you an environment that allows you to connect very simple games to the Python code. In this environment you can build an agent and train him on the game.

First, you need to install the package by typing the following command in bash

```
pip install gym
```

We'll see an example of a very simple game called FrozenLake-v0

The rules of the game are:

Start with the letter "S"

You have to reach the letter G

At any stage you can move within the field to the right, left, up and down.

When you walk on the letter F move randomly (may be different from the direction you wanted)

When you reach the letter H, you are disqualified and you can not move anymore.
The steps are:

- 0 - left
- 1 - Down
- 2 - Right
- 3 - Up

In [1]:

```
import gym
import sysenv = gym.make('FrozenLake-v0')
env.reset()
env.render()
move = input()
while move.isdigit():
dig_move = int(move)
if (dig_move < env.action_space.n):
env.step(dig_move)
env.render()
move = input()
```

In order for us to analyze the game and we can teach the agent to play the game, a mathematical model can be used to simulate the game.

The MDP model is a randomly controlled time-controlled process or part of the process is random and user-controllable.

The model contains four parts (S, A, R, T):

- S (State) - Possible situations of the world. Some states can have termination modes. In the FrozenLake game, there are 4x4 = 16 modes (the board does not change its position, only the player moves on the board and can be in any of the 16 squares).There are 5 final modes when the player stands on H or G.
- A (Actions) - The actions that the agent can do, in the FrozenLake game, have 4 actions left, right, up and down.
- R (Reward function) - Will be marked in R (s, a) where s is any state and a is an action done from the same states. In the FrozenLake game, if the agent is in a certain position and after the action has reached H mode then the reward for this action is 0 and if it reaches G mode then the reward for this action will be 1. Future rewards are assumed to be discounted with a discount factor ($\gamma$). When you want to check the reward of a particular action, then you also include the following rewards from this action, but the reward of a future action is worth less at this point. So we add the future reward twice $\gamma$ to reduce the future value. For example, if $\gamma$=0.7, and the agent receives a reward of 4 this turn, 7 next turn, and 32 the following turn. The total value for the agent will be 4 + $\gamma$(7 + $\gamma$32) = 4 + 0.7(7 + 22.4) = 24.58
- T (Transition function (T(s,a,s')) - Sometimes it will be marked with P. The probability of getting from s mode with action a to s' mode.

To expand the subject link

We will focus on Model-Free MDP ie, we do not know anything about Transition function, or the reward function. But we know all the states and possible actions.

We will define some concepts for understanding the sequel:

- $\large{Q_\pi values}$ - The expected value of the reward from this action until the end of the game according to the agent's operating policy. For example, if we are in a state before the last and you are moving to a final state and the reward for this action is 2, then we get $Q_\pi values(s,\ a) = 2$. By this definition, if we have all the values of $Q_{\pi^*}value$ (which are obtained by the optimal policy) and we are in state s then we will select the action that has the maximum $Q_\pi value$, that is, we perform the following formula: $ arg\ max_{a' \in A} Q(s,\ a') $

There are two approaches to selecting the action in each state and learning Qvalue, Exploration and Exploitation values

- $\large{Exploration}$ - We will perform a random action and check the value of this action and save the value.
- $\large{Exploitation}$ - We will take action based on what information we have learned so far.

We will combine the methods, that is, we will perform a few random actions and a lot of actions based on previous things, in this way also learn new situations and learn to work correctly and progress in the real direction. And the approach that will use only one of the methods is incorrect because learning only randomly will not reach other world situations, but only rarely (it is very difficult to finish a game in a random way). And learning only on the basis of previous information will not introduce new moves. Therefore, a small value ($\epsilon$) is set so that we will perform random action only if a value is smaller than $\epsilon$ otherwise we will perform the intelligent operation. In the probability $\epsilon$ we will execute random action and probability 1 - $\epsilon$ will perform an action based on previous information.

- $\large{Q Learning}$ - Suppose that we in state s perform action a and reach s' state and get reward r (Assume that T(s, a, s') = 1) and assume that we have all the optimal $Q_{\pi^*}values$ of all possible states and actions except the $Q value$ of our state with the action we want to perform.
Then to find the missing $Q value$ we will use the following formula: $Q(s,\ a)\ =\ r\ +\ \gamma max_{a'}Q(s',\ a')$.

When we want to learn how to play we will use the formula above. We will have a value of $Q(s,\ a)$ but the value $r\ +\ \gamma max_{a'}Q(s',\ a')$ will be a little more accurate and therefore we want $Q(s,\ a)$ to be slightly affected $r\ +\ \gamma max_{a'}Q(s',\ a')$ So we'll use the formula $Q(s_t,\ a_t) = \underbrace{Q(s_t,\ a_t)}_{old\ value} + \underbrace{\alpha}_{learning\ rate}\left(\overbrace{\underbrace{r_{t+1}}_{reward} + \underbrace{\gamma}_{discount\ factor} * \underbrace{\underset{a}{max} Q(s_{t+1},\ a)}_{estimate\ of\ optimal\ future\ value}}^{learned\ value} - \underbrace{Q(s_t,\ a_t)}_{old\ value}\right).$ This is what we get if we perform gradient descent, and define the loss as $\left(\left(r + γmax_{a'}Q\left(s',a'\right)\right) – Q(s,a)\right)^2$, while referring to Q(s,a) as a weight (Wsa), and Q(s',a') as a constant.

The formula above is one of the two approaches to learn the $Q(s_t,\ a_t)$, it calls $\large{OFF-Policy}$, in this approach when calculating the $Q value$ of state t we assume that in the next state we will perform the best action but in fact, there is a probability that in the next state we will get a number less than $\epsilon$ and we choose a random action, not necessarily the best action.

In $\large{ON-Policy\ (SARSA)}$, We choose that action of next state and then we calculate the $Q value$ that is when we calculate the $Q value$ we know the action of next state. The formula of this approach is: $Q(s_t,\ a_t) = \underbrace{Q(s_t,\ a_t)}_{old\ value} + \underbrace{\alpha}_{learning\ rate}\left(\overbrace{\underbrace{r_{t+1}}_{reward} + \underbrace{\gamma}_{discount\ factor} * \underbrace{Q(s_{t+1},\ a_{t+1})}_{estimate\ of\ optimal\ future\ value}}^{learned\ value} - \underbrace{Q(s_t,\ a_t)}_{old\ value}\right).$

In [2]:

```
import gym
import sys
import random
import numpy as npepsilon = 0.1
gamma = 0.95
alpha = 0.1env = gym.make('FrozenLake-v0')
env.reset()
env.render()# Q is a table of states and all actions per state
Q = np.zeros ([env.observation_space.n,env.action_space.n])
for i in range(10000): #number of games
s = env.reset()
done = False
while not done: #until game over, or max number of steps
if random.random() < epsilon:
a = env.action_space.sample()
else:
a = np.argmax(Q[s,:])
# Values are returned: the next state, reward and whether the game is over
s_n, r, done, _ = env.step(a)
Q[s,a] = Q[s,a] + alpha*(r + gamma * np.max(Q[s_n,:]) - Q[s,a])
s = s_n
```

In [3]:

```
Q
```

Out[3]:

The method presented above, which produces a table representing all states and all actions for each state, is good when there is a small number of states, but what happens if there are lots of states?

The answer is that we will try to predict the Q-values by a network that will receive state and action and return the Q-value, and we will select the action that returns the maximum Q-value. In this way we will not have to remember all the values but predict them when we need them.

But where is data for this network train?

We perform an action (that maximizes the Q-value) and get the reward (r), the new state ($s_n$), after this we calculate the Q-values of the new state (using the same network), for each of the actions.

Now the "real" Q-value, for the previous state-action, is: $r + γ * max_{a'}Q(s_n,\ a')$$.

In [ ]:

```
import gym
import time
import random
import numpy as np
import tensorflow as tfepsilon = 0.1
gamma = 0.999
num_of_games = 10env = gym.make('Pong-v0')# height:210, width: 160, Since the image is RGB then we get another 3 dimensions.
state = tf.placeholder(tf.float32, shape=[1, 210, 160, 3])
state_vec = tf.reshape(state, [1, 210*160*3])# env.action_space.n returns the number of action per state
W1 = tf.Variable(tf.truncated_normal([210*160*3,env.action_space.n], stddev=1e-5))
b1 = tf.Variable(tf.constant(0.0, shape=[env.action_space.n]))Q4actions = tf.matmul(state_vec,W1) + b1# Q_n, will denote the maximum Q (among all actions).
Q_n = tf.placeholder(tf.float32, shape=[1, env.action_space.n])loss = tf.pow(Q4actions - Q_n, 2)
update = tf.train.GradientDescentOptimizer(1e-10).minimize(loss)sess = tf.Session()
sess.run(tf.global_variables_initializer())
for i in range(num_of_games):
env.reset()
s, _, done, _ = env.step(0) #first move doesn't matter anyway, just get the state
while not done:
all_Qs = sess.run(Q4actions, feed_dict={state:[s]})
if random.random() < epsilon:
next_action = env.action_space.sample()
else:
next_action = np.argmax(all_Qs)
s_n, r, done, _ = env.step(next_action)
Q_corrected = np.copy(all_Qs)
next_Q = sess.run(Q4actions, feed_dict={state:[s_n]})
Q_corrected[0][next_action] = r + gamma * np.max(next_Q)
sess.run(update, feed_dict={state:[s], Q_n: Q_corrected})
s = s_n #move to the next state
```

When we use linear regression we don't get good results, therefore, we will use fully connected layers and CNN.

We can improve the results by using:

- Start playing randomly for the first X games.
- Probability ($\epsilon$) for random move decreases over time.

Until now we use the algorithm that computes the Q value base on a given game policy that is we estimate the value of state s with each action (base on our game policy) and choose the action that returns the maximum of Q value. In this method, the algorithm tries to found the most suitable Q value of each state.

The policy-based algorithms learning the policy of the game $\pi_\theta$ in each iteration that is, the algorithm re-define the policy at each step and computes the value according to this new policy until the policy converges.

In the policy-based algorithm wants to know if the policy is better than other policy, there for it compares the expected value of this policy with the previous policies, but in order to calculate the expected value the algorithm has to know the value of each action that it takes.
There are two solutions to this problem:

- The algorithm reaches the end state, it recursively calculates the values and updates the learning weights, that reflect the quality of the actions.

The algorithm will look like this:

$function\ REINFORCE$

$Initialise\ \theta\ weights$

$for\ each\ episode\ \{(s_1,\ a_1,\ r_2),\ (s_2,\ a_2,\ r_3),...\ ,(s_{T-1},\ a_{T-1},\ r_{T})\}$ ~ $\pi_\theta\ do$

$calculate\ all\ v_i\ values\ (1\le i\le T,Similar\ to\ Qvalue)$

$for\ t\ =\ 1\ to\ T-1\ do$

$\theta\leftarrow\theta+\alpha\nabla_{\theta}log\pi_\theta(s_t,\ a_t)v_t\ (Update\ the\ weights\ to\ maximize\ the\ expected\ value)$

$return\ \theta$ - Similar to method 1, but calculates the $v_i$ values during the game by MLP.

The algorithm will look like this:

$function\ Actor-Critic$

$Initialise\ \theta\ weights$

$Sample\ action\ a\ $ ~ $\ \pi_\theta$

$for\ each\ step\ do $

$sample\ reward\ r=R(s,a),\ sample\ next\ state\ s',\ sample\ action\ a'\ $~$\pi_\theta(s',a')$

$\delta = r+\gamma Q_w(s',a')-Q_w(s,a)$

$\theta = \theta + \alpha \nabla_{\theta}log\pi_\theta(s,a)Q_w(s,a)$

$w = w + \beta\delta\phi(s,a)$

$a = a'$

$s = s'$

$return\ w,\ \theta$