Ariel University

Course name: Deep learning and natural language processing

Course number: 7061510-1

Lecturer: Dr. Amos Azaria

Edited by: Moshe Hanukoglu

Date: First Semester 2018-2019

Recurrent Neural Networks (RNN)

Introduction

An analysis of sentences is used in many fields, such as translating one language into another, writing a description of pictures, analysis of sound, etc. In order for us to perform the task properly and we can understand and create new sentences we must understand the content of the entire text and not just check what words appear in it.
In this tutorial we will focus on the analysis of text by Recurrent Neural Networks (RNN).
On the face of it, we can ask why not use a fully-connected network to analyze the trial? In order to use fully-connected we will need to convert the sentence into a form that we can insert into the network.
One method is a bag of words. This method is not a good method because in this method we do not give importance to the order of the words within the sentence, and the order of the words in the sentence is very critical for proper understanding of the meaning of the sentence.

For example:

  1. The child ate pizza and did not eat vegetables
  2. The child ate vegetables and did not eat pizza

The meaning of these two sentences is the opposite of each other, but if we implement the bag of words method for both sentences we will get the same result.
Therefore we can not use this method.

And the use of the one hot method to represent the words, which gives importance to the location of the words in the sentence, is incorrect because then we do not give importance to the meaning of the words but give all the words the same weight.

Another way to try is to use CNN, since CNN gives importance to information points that are close to each other. But still do not receive general evidence of the entire trial.

Therefore, as noted above, we will learn a new method for analyzing text called RNN.

As we mentioned above, analysis of text can be in many cases. We will detail these cases:

  1. One to one - The input is single (single image or word) that is classified into a single class (binary classification). For example: Is it home or not?
  2. One to many - The input is single (image) that is categorized into many departments or written (in words) to the image. For example: for this image credit
    We want it to say: Two birds fly.
  3. many to one - The input is a group (images or words) that is classified into a single class (binary classification). For example: Does the video (group of pictures) show a birthday party or not, is the sentence urgent or not.

  4. Many-to-Many Input is a group (images or words) that is converted to another group.

    • Translation of sentences. The text is first decoded and only then is the answer given to the output.
    • Automatic speech recognition. Speech is deciphered directly into text, you hear a word and translate it and then another word and translate it too.

^ Contents

Description of the network structure

The RNN structure is: In each iteration get the next character / word $(x_i)$ and the memory of all the so-called until now. The output is the next character / word or the translation / meaning $(h_i)$ of the text so far, the output is also input to the next iteration.
Each of the iterations is represented by a square that is written inside it.
The two structures are identical, but the right structure is the non-rolled form of the left structure.

credit

After we have seen the general structure of all iterations, we will now focus on the internal structure of each iterations.

After we have seen the general structure of all iterations, we will now focus on the internal structure of each iterations.

RNN has several different implementations

  • Long Short Term Memory (LSTM)
  • GRU

We will explain the implementation of each method.

^ Contents

Long Short Term Memory (LSTM)

The full appearance of the internal structure of each iteration is

credit

Now we will explain each and every part of the above structure:

  • The red part: Forget what has been learned until now. That is, doubling the "memory" in the number from 0 to 1 when 0 means forgetting everything learned until now and 1 means remembering everything that has been learned until now, each number in this range means a few percent to remember from what has been learned so far. The sign $\sigma$ symbolizes logistic regression. The input is the previous memory and the new word (which is not yet read) and the output is a number between 0 and 1 as described above.
  • Green part: remember the new thing learned (in addition to the old memory) and if so, how much of it to remember, that is, what importance to give to the new thing. That is, this part is responsible for adding memory to the long-term memory. This section consists of two parts:
    1. Logistic regression that determines whether to remember the new thing. His input is the old memory and the new word and his output is a number between 0 and 1 that says whether to remember.
    2. Creates the new memory and adds it to the old memory.
  • Blue part: says how much of all memory created at this stage should move on. We put the memory into the tanh function to get numbers between -1 and 1. The result will be the input of the next step. That is, this part is responsible for short-term memory, the memory that goes to the next stage.

In order to expand your understanding of the subject you can read on this web is a great web that explains the entire structure very well.

^ Contents

How Many Weights In LSTM?

One of the important parameters in understanding the system structure is knowing how many weights there are in the system. To do this we will perform an analysis of the amount of weights in the system.
Let's mark a few things:

  • k - The dimension of the new input $(x_i)$
  • d - the dimension of the memory transferred from one stage to another (the width of the horizontal lines)
  • t - like steps / cells
  • m - Mini-batch size

In this implementation, we perform 3 times linear regression and one time tanh, each function is performed d times (for each memory cell). For each of the functions the inputs are entered, $x_i$ (k weights) and the memory from the previous step (d weights), that is there are k + d weights and d d times. So the total account is: $\left(\left(k + d\right) + 1 \right) * 4 * d$
It can be seen that m and t do not affect the number at all.

A different method for realizing RNN is called GRU. The realization of this method is shown in the image below. This method is similar in its parts to the previous method (LSTM) but has joined several parts together.

credit

^ Contents

Generating Text in TensorFlow With RNN

In [1]:
import tensorflow as tf
import numpy as np
import urllib
In [2]:
#A list that holds all characters as ASCII values
all_text = [ord(x) for x in list(urllib.request.urlopen("https://s3.amazonaws.com/text-datasets/nietzsche.txt").read().decode("utf-8"))]
In [3]:
batch_size = 15
num_past_characters = 10 # The number of previous characters that predict the next character
step = 1
In [4]:
# A list that holds "representative" for each ASCII value in the text, 
# that is, if there are ASCII values in the text: 65,65,81,70,81 then this list will hold 65,81,70
char_mapping = list(set(all_text))max_char = max(char_mapping)
In [5]:
possible_chars = len(char_mapping) #The number of different possible text characters
In [6]:
inverse_char_map = np.zeros(max_char+1, dtype=int)
for i in range(possible_chars):
inverse_char_map[char_mapping[i]] = i
In [7]:
num_input_examples = int((len(all_text)-(num_past_characters+1))/step)
curr_batch = 0

Create data for the train.
In data_x there are $|batch\_size|$ rows in each of them have $|num\_past\_characters|$ characters and each character is represented by the one hot method, meaning that each character is represented by an array of the size of $|possible\_chars|$
In data_y there are the characters that each of the sentences should predict, which means that each sentence has one character represented in the one hot method, so the size of data_y is $|batch\_size|$ rows and $|possible\_chars|$ columns.

In [8]:
def next_batch():
global curr_batch
curr_range_start = curr_batch*batch_size
if curr_range_start + batch_size >= num_input_examples:
    return (None, None)
data_x = np.zeros(shape=(batch_size,num_past_characters,possible_chars))
data_y = np.zeros(shape=(batch_size,possible_chars))
for ep in range(batch_size):
    for inc in range(num_past_characters):
data_x[ep][inc][inverse_char_map[all_text[curr_range_start+ep*step+inc]]] = 1
    data_y[ep][inverse_char_map[all_text[curr_range_start+ep*step+num_past_characters]]] = 1
curr_batch += 1
return (data_x, data_y)def back_to_text(to_convert): # gets a one-hot encoded matrix and returns text. We'll need this at the end
return [chr(char_mapping[np.argmax(letter)]) for letter in to_convert]
In [9]:
a = next_batch()
In [10]:
a[0].shape
Out[10]:
(15, 10, 84)
In [11]:
a[1].shape
Out[11]:
(15, 84)
In [12]:
np.max(a[0])
Out[12]:
1.0
In [13]:
np.max(a[0],2)
Out[13]:
array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
In [14]:
np.argmax(a[0],2)
Out[14]:
array([[42, 44, 31, 32, 27, 29, 31,  1,  1,  1],
       [44, 31, 32, 27, 29, 31,  1,  1,  1, 45],
       [31, 32, 27, 29, 31,  1,  1,  1, 45, 47],
       [32, 27, 29, 31,  1,  1,  1, 45, 47, 42],
       [27, 29, 31,  1,  1,  1, 45, 47, 42, 42],
       [29, 31,  1,  1,  1, 45, 47, 42, 42, 41],
       [31,  1,  1,  1, 45, 47, 42, 42, 41, 45],
       [ 1,  1,  1, 45, 47, 42, 42, 41, 45, 35],
       [ 1,  1, 45, 47, 42, 42, 41, 45, 35, 40],
       [ 1, 45, 47, 42, 42, 41, 45, 35, 40, 33],
       [45, 47, 42, 42, 41, 45, 35, 40, 33,  4],
       [47, 42, 42, 41, 45, 35, 40, 33,  4, 75],
       [42, 42, 41, 45, 35, 40, 33,  4, 75, 63],
       [42, 41, 45, 35, 40, 33,  4, 75, 63, 56],
       [41, 45, 35, 40, 33,  4, 75, 63, 56, 75]])
In [15]:
cellsize = 30
x = tf.placeholder(tf.float32, [None, num_past_characters, possible_chars])
y = tf.placeholder(tf.float32, [None, possible_chars])lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(cellsize, forget_bias=0.0)
output, _ = tf.nn.dynamic_rnn(lstm_cell, x, dtype=tf.float32)output = tf.transpose(output, [1, 0, 2])
last = output[-1]W = tf.Variable(tf.truncated_normal([cellsize, possible_chars], stddev=0.1))
b = tf.Variable(tf.constant(0.1, shape=[possible_chars]))z = tf.matmul(last, W) + b
res = tf.nn.softmax(z)cross_entropy = tf.reduce_mean(-tf.reduce_sum(y * tf.log(res), reduction_indices=[1]))
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy)
In [16]:
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(res,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))num_of_epochs = 200    
for ephoch in range(num_of_epochs):
acc = 0
curr_batch = 0
while True:
    batch_xs, batch_ys = next_batch()
    if batch_xs is None:
break
    else:
sess.run(train_step, feed_dict={x: batch_xs, y: batch_ys})
acc += accuracy.eval(feed_dict={x: batch_xs, y: batch_ys})print("step %d, training accuracy %g"%(ephoch, acc/curr_batch))
step 0, training accuracy 0.37731
step 1, training accuracy 0.462628
step 2, training accuracy 0.488142
step 3, training accuracy 0.502892
step 4, training accuracy 0.511578
step 5, training accuracy 0.517328
step 6, training accuracy 0.522003
step 7, training accuracy 0.526037
step 8, training accuracy 0.529447
step 9, training accuracy 0.532766
step 10, training accuracy 0.53553
step 11, training accuracy 0.537953
step 12, training accuracy 0.539881
step 13, training accuracy 0.541593
step 14, training accuracy 0.54284
step 15, training accuracy 0.544211
step 16, training accuracy 0.545166
step 17, training accuracy 0.546002
step 18, training accuracy 0.547057
step 19, training accuracy 0.547824
step 20, training accuracy 0.548392
step 21, training accuracy 0.548999
step 22, training accuracy 0.549583
step 23, training accuracy 0.550136
step 24, training accuracy 0.55063
step 25, training accuracy 0.551269
step 26, training accuracy 0.551648
step 27, training accuracy 0.552025
step 28, training accuracy 0.552595
step 29, training accuracy 0.552988
step 30, training accuracy 0.553251
step 31, training accuracy 0.553356
step 32, training accuracy 0.553636
step 33, training accuracy 0.553792
step 34, training accuracy 0.554205
step 35, training accuracy 0.554285
step 36, training accuracy 0.554439
step 37, training accuracy 0.554719
step 38, training accuracy 0.554885
step 39, training accuracy 0.555198
step 40, training accuracy 0.555388
step 41, training accuracy 0.555401
step 42, training accuracy 0.555629
step 43, training accuracy 0.555686
step 44, training accuracy 0.555916
step 45, training accuracy 0.555851
step 46, training accuracy 0.555869
step 47, training accuracy 0.556132
step 48, training accuracy 0.556205
step 49, training accuracy 0.556437
step 50, training accuracy 0.556561
step 51, training accuracy 0.556904
step 52, training accuracy 0.557059
step 53, training accuracy 0.557011
step 54, training accuracy 0.557224
step 55, training accuracy 0.557062
step 56, training accuracy 0.557164
step 57, training accuracy 0.557267
step 58, training accuracy 0.557274
step 59, training accuracy 0.557417
step 60, training accuracy 0.557568
step 61, training accuracy 0.55768
step 62, training accuracy 0.557816
step 63, training accuracy 0.557928
step 64, training accuracy 0.558011
step 65, training accuracy 0.557978
step 66, training accuracy 0.557981
step 67, training accuracy 0.558142
step 68, training accuracy 0.558219
step 69, training accuracy 0.558251
step 70, training accuracy 0.55843
step 71, training accuracy 0.558402
step 72, training accuracy 0.558454
step 73, training accuracy 0.558292
step 74, training accuracy 0.558287
step 75, training accuracy 0.558367
step 76, training accuracy 0.558557
step 77, training accuracy 0.558537
step 78, training accuracy 0.55849
step 79, training accuracy 0.558588
step 80, training accuracy 0.558587
step 81, training accuracy 0.55883
step 82, training accuracy 0.55848
step 83, training accuracy 0.558543
step 84, training accuracy 0.558395
step 85, training accuracy 0.55851
step 86, training accuracy 0.558632
step 87, training accuracy 0.558608
step 88, training accuracy 0.558608
step 89, training accuracy 0.558836
step 90, training accuracy 0.558645
step 91, training accuracy 0.559054
step 92, training accuracy 0.558751
step 93, training accuracy 0.558965
step 94, training accuracy 0.558891
step 95, training accuracy 0.558735
step 96, training accuracy 0.558698
step 97, training accuracy 0.558948
step 98, training accuracy 0.559004
step 99, training accuracy 0.558813
step 100, training accuracy 0.558732
step 101, training accuracy 0.558795
step 102, training accuracy 0.558975
step 103, training accuracy 0.558973
step 104, training accuracy 0.558603
step 105, training accuracy 0.558836
step 106, training accuracy 0.558953
step 107, training accuracy 0.559061
step 108, training accuracy 0.558738
step 109, training accuracy 0.559118
step 110, training accuracy 0.559111
step 111, training accuracy 0.559377
step 112, training accuracy 0.558855
step 113, training accuracy 0.559174
step 114, training accuracy 0.559354
step 115, training accuracy 0.559251
step 116, training accuracy 0.559209
step 117, training accuracy 0.559214
step 118, training accuracy 0.559211
step 119, training accuracy 0.55954
step 120, training accuracy 0.559377
step 121, training accuracy 0.559562
step 122, training accuracy 0.559599
step 123, training accuracy 0.559445
step 124, training accuracy 0.559645
step 125, training accuracy 0.55957
step 126, training accuracy 0.559414
step 127, training accuracy 0.559607
step 128, training accuracy 0.559644
step 129, training accuracy 0.559855
step 130, training accuracy 0.559725
step 131, training accuracy 0.559582
step 132, training accuracy 0.559377
step 133, training accuracy 0.559712
step 134, training accuracy 0.55957
step 135, training accuracy 0.559738
step 136, training accuracy 0.559887
step 137, training accuracy 0.559988
step 138, training accuracy 0.560085
step 139, training accuracy 0.559996
step 140, training accuracy 0.560116
step 141, training accuracy 0.560246
step 142, training accuracy 0.560043
step 143, training accuracy 0.560161
step 144, training accuracy 0.560298
step 145, training accuracy 0.559973
step 146, training accuracy 0.560374
step 147, training accuracy 0.560348
step 148, training accuracy 0.560374
step 149, training accuracy 0.560201
step 150, training accuracy 0.560274
step 151, training accuracy 0.560361
step 152, training accuracy 0.560303
step 153, training accuracy 0.560234
step 154, training accuracy 0.560314
step 155, training accuracy 0.560479
step 156, training accuracy 0.560256
step 157, training accuracy 0.560396
step 158, training accuracy 0.560595
step 159, training accuracy 0.560402
step 160, training accuracy 0.560536
step 161, training accuracy 0.560346
step 162, training accuracy 0.560246
step 163, training accuracy 0.560331
step 164, training accuracy 0.560296
step 165, training accuracy 0.56061
step 166, training accuracy 0.560652
step 167, training accuracy 0.560326
step 168, training accuracy 0.560549
step 169, training accuracy 0.56062
step 170, training accuracy 0.560875
step 171, training accuracy 0.560782
step 172, training accuracy 0.560532
step 173, training accuracy 0.560789
step 174, training accuracy 0.56075
step 175, training accuracy 0.560532
step 176, training accuracy 0.560344
step 177, training accuracy 0.560539
step 178, training accuracy 0.560867
step 179, training accuracy 0.560705
step 180, training accuracy 0.56087
step 181, training accuracy 0.560667
step 182, training accuracy 0.560742
step 183, training accuracy 0.560915
step 184, training accuracy 0.56067
step 185, training accuracy 0.560372
step 186, training accuracy 0.560794
step 187, training accuracy 0.560474
step 188, training accuracy 0.56082
step 189, training accuracy 0.560566
step 190, training accuracy 0.560391
step 191, training accuracy 0.560544
step 192, training accuracy 0.560692
step 193, training accuracy 0.560615
step 194, training accuracy 0.560572
step 195, training accuracy 0.560346
step 196, training accuracy 0.560301
step 197, training accuracy 0.560509
step 198, training accuracy 0.560501
step 199, training accuracy 0.560504
In [17]:
def text2arr(source, start):
src_as_num = [ord(x) for x in source]
ret_arr = np.zeros(shape=(1,num_past_characters,possible_chars))
for inc in range(num_past_characters):
    ret_arr[0][inc][inverse_char_map[src_as_num[start+inc]]] = 1
return ret_arrrequested_length = 200
text = list("hello world")
for i in range(requested_length):
predicted_letter = back_to_text(res.eval(feed_dict={x:text2arr(text, len(text)-num_past_characters)}))
text += predicted_letter
print(''.join(text))
hello world the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the sense of the 

One of the problems caused by predicting text can be that you enter a loop and then the text is duplicated and does not produce new text but repeats existing text. This problem is created once the characters that guess the next character are repeated. In order to solve this problem, we will drag the following character according to the softmax returns, which means that a character with a high probability in softmax will be more likely to be the next character, in such a way that the probability of receiving a sequence of characters is the same as the previous ones.

^ Contents

tf.nn.rnn_cell

tf.nn.rnn_cell.BasicLSTMCell.__init__(num_units, forget_bias=1.0, state_is_tuple=True, activation=tanh)

Documentation of the function in this link

^ Contents

Multiple Layers of RNN

As we could take a single neuron and produce from it a grid with several layers so we can also produce a grid with several layers of RNN which means that all h_i will become x_i of the next layer.

The code that does this is:

cells_size = [120, 70, 50]def get_lstm_dropout(cellsize, dropout):
lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(cellsize, forget_bias=1.0) 
   dopoutRNN = tf.nn.rnn_cell.DropoutWrapper(lstm_cell , output_keep_prob=dropout)
return dropoutRNNrnn_cell.MultiRNNCell([get_lstm_dropout(mem_size, dropout) for mem_size in cells_size])

^ Contents

BLEU (Bilingual Evaluation Understudy) ScoreRNN

BLEU is a measure of the quality of translation carried out by machine translation (MT, candidates) in relation to the translation of a professional person (reference sentence), which takes into account the general translation and does not refer to lack of intelligence or grammatical errors in translation. This metric is based on a comparison of candidate for the correct translation of the sentence, reference sentence. The output is a number between 0 and 1 where as close to 1 means that it is closer to the reference sentence.

The algorithm:

  1. Divide the candidate into n-gram units (usually used in 4-gram).
  2. Sum the number of times each candidate n-gram unit appears in the reference sentence and divide the number of n-gram units in the candidate. (When one unit is checked, the order of the words is not important, but whether or not the words appear)

Because of the algorithm's method of operation, it will give a higher score to shorter sentences, whereas in many cases the long sentences are correct and therefore we will multiply the result given by the algorithm by $e^{1-r/c}$, r is the total length of all the reference sentences, c is the total length of all the candidates