Hope you already know the basics of Recurrent Neural Networks (RNN)s. If not, feel free to refer this article. In Natural Language Processing (NLP), the RNNs have played a major role in sequence modeling. Let’s see what RNNs can do and what are their weak points. Without jumping into the topic, let’s move slowly!

In an RNN, we feed the output of the previous timestep as an input in the next timestep. We found that they work well with sequential information like sentences.

For clarity, look at the example below:

If we consider an example of predicting the next word based on previous words in a sentence, using an RNN, the below image depicts how the RNN behaves in each timestep. The entire hidden state is depicted in yellow color rectangle.

You can see that finally the hidden state contains a summary of the sentence.

This is how the neurons are connected and RNN appears.

But at each timestep, they have individual losses. (You can sum up them together to get a single loss value).

If considered backpropagation, in RNN we have an additional parameter called ‘time’ other than the weight matrix. Timestep 5 will propagate the gradient in the usual way. At timestep 4, we have to consider the gradient of timestep 5 also. At timestep 3 we have to consider the gradients of all the timesteps from last timestep.

In this way, in the backpropagation of RNN for timestep r, we have to propagate the gradients of last timestep to timestep r-1.

What you could see from the above?

✔ Rather than having one or two previous words to predict the next word in a sentence, this approach is dependency preserving!

By considering just the previous word for example ‘a’ in above context, neural network may have many possibilities to predict: a river, a student, a car etc. (The possibilities depend on the words in its’ corpus).

But in this RNN-based approach, we have a sequence ‘Anne bought a’ to predict next word. So now it has very low possibilities for the next word to become ‘river’ or ‘student’.

This is a significant outcome that can be gained using an RNN.

So, what’s wrong? Just recall the theoretical stuff I just explained with the example.

For the time being, LSTM (Long Short Term Memory) was introduced, which is able to address the pitfalls of RNNs!

First Published here

                                              *Lead Image by chenspec from Pixabay*