From Bag-of-words to Attentions Models

NLP mainly focuses on the branch of AI that aims on making machines understand textual information. So naturally the first step of this journey (Figure 2), was to convert the textual information into numbers or matrices or vectors so that machines can make sense of the underlying data. As shown in the diagram, each step focused on bringing some aspect that helped in bringing an information/performance gain compared to the previous step.

Figure 2: The Building Blocks Of Today’s NLP

· Bag-of-words:Converted textual information to numerical form“

That’s exactly what Bag of words method did. It converted the words/tokens in a sentence into numbers/matrices using different techniques like count vectorization or tfidf or n-grams. Couple this with cosine similarity, the NLP community was able to make significant progress in gaining insights from textual information. However, the one major drawback of this technique was that it gave no importance to the underlying meaning of the words used nor the context.

· Word Embedding’s: “Extracted meaning from Represented words”

Word embedding techniques attempted to extract the meaning behind the words used in the sentence. This method, basically aimed to represent every word in the vocabulary in a n-dimensionality space, such that, similar word will appear closer to one another. Word2vec and GloVe are common models used to create vector representation of words. This is similar to how computer vision techniques extract n features of the face for facial recognition and compare them to validate the different users. This was a significant step in the NLP community, and many believed this was the inflection point for NLP as it allowed NLP to “transfer learning” from one use case to another. This was not previously possible. However, one major drawback of this method was that it ignored the positional information of the words.

· RNN: “Extracted positional information”

RNN based models gave significance to the positional information as well. Moreover, it was able to handle use cases where input and output were of different length which is required for language translation. It leveraged a recurrent neural network for this. As with any RNN, each word in a sentence in predicted not only based on the current input but also based on prior inputs. However, one drawback here, was that this technique didn’t do well in long sentences. The reason was that prior information gets diluted as the sentences gets longer.

· LSTM Based Models: “Solved the problem of diminishing gradients”

LSTM models helped to overcome the problem of diminishing gradients as it had a mechanism to forget irrelevant information and take only relevant information to deeper layers. They used gates for this purpose that enabled each layer to decide was information needed to be kept and what information can be discarded. While LSTM solved the vanishing gradient problem, it still had a drawback as this technique used only prior information for prediction. However, an in many NLP based tasks (like translation), in-order to make a good judgement, it is important to have information of words used later as well.

· Bi-Directional LSTM: “Considered the complete sentence in order to predict”

Bi-directional RNN or bi-directional LSTM models worked very similar to LSTM based models. The only difference was that it took the complete text (both past and future text) into account in-order to predict the present word. These models have both a forward recurrent component and a backward recurrent component. One major disadvantage of this technique was that it required a completed sequence of data to make in a prediction. However, human do not work like this. For example, if a text was given to be translated to another language, human need not hear the complete text in-order to start translation. Rather, after hearing a substantial part of the text, a human can give different attention to the words already heard & can make the translation with a certain degree of confidence.

· Attention based models: “Ability to focus on relevant input via attention weights”

These models differ from a bi-directional RNN/LSTM model as these models looks at an input sequence and decides at each step which other part of the sequence is important or needs attention. So apart from the forward recurrent component, the backward recurrent component, hidden state & previous output, the model also considers the weightage of the words surrounding the current context. This weightage, called attention weights, helped to give relevant weight to different parts of the input. While this method helped improve the accuracy and perform more human like, it faced a challenge in terms of performance. As computation was done sequentially, it was difficult to scale up this solution in practical applications. The sequential nature of the model architecture prevented parallelization.

· Transformers: “Parallelized the processing of sequential data for better performance”

Transformers differed from a sequence to sequence model as it did not use a RNN network but rather focused on leveraging attention mechanism. Attention mechanism allowed transformers to process all elements simultaneously by forming direct connections between individual elements. Not only did it enable parallelization, but it also results in a higher degree of accuracy across a range of tasks. In order to remember the sequence of input, the positions of each word were embedded into the representation. However, even transformers had its limitations. The attention often dealt with a fixed-length text strings, which caused an issue of context fragmentation. However, this limitation was overcome via a modified architecture call transformer-XL, when hidden states obtained from previous segments were reused as input in the current segment. There are different implementations of transformers, like BERT and ALBERT which combines current trends like attention based transformers and transfer learning to get superior results.