In a recent post on BERT, we discussed BERT transformers and how they work on a basic level. The article covers BERT architecture, training data, and training tasks. However, we don't really understand something before we implement it ourselves. So in this post, we will implement a Question Answering Neural Network using BERT and a Hugging Face Library. In this task, we are given a question and a paragraph in which the answer lies to our BERT Architecture and the objective is to determine the start and end span for the answer in the paragraph.
BERT is a state-of-the-art model by Google that came in 2019. In this blog, I will go step by step to finetune the BERT model for movie reviews classification(i.e positive or negative). Here, I will be using the Pytorch framework for the coding perspective. BERT is built on top of the transformer (explained in paper Attention is all you Need). Input text sentences would first be tokenized into words, then the special tokens ( [CLS], [SEP], ##token) will be added to the sequence of words.
This project attempts to build a Question- Answering system in the News Domain, where Passages will be News articles, and anyone can ask a Question against it. We have built a span-based model using an Attention mechanism, where the model predicts the answer to a question as to the position of the start and end tokens in a paragraph. For training our model, we have used the Stanford Question and Answer (SQuAD 2.0) dataset. To do well on SQuAD 2.0, systems must not only answer questions when possible but also determine when no answer is supported by the paragraph and abstain from answering. Our model architecture comprises three layers- Embedding Layer, RNN Layer, and the Attention Layer. For the Embedding layer, we used GloVe and the Universal Sentence Encoder. For the RNN Layer, we built variations of the RNN Layer including bi-LSTM and Stacked LSTM and we built an Attention Layer using a Context to Question Attention and also improvised on the innovative Bidirectional Attention Layer. Our best performing model which uses GloVe Embedding combined with Bi-LSTM and Context to Question Attention achieved an F1 Score and EM of 33.095 and 33.094 respectively. We also leveraged transfer learning and built a Transformer based model using BERT. The BERT-based model achieved an F1 Score and EM of 57.513 and 49.769 respectively. We concluded that the BERT model is superior in all aspects of answering various types of questions.
This is arguably the most important architecture for natural language processing (NLP) today. Specifically, we look at modeling frameworks such as the generative pretrained transformer (GPT), bidirectional encoder representations from transformers (BERT) and multilingual BERT (mBERT). These methods employ neural networks with more parameters than most deep convolutional and recurrent neural network models. Despite the larger size, they've exploded in popularity because they scale comparatively more effectively on parallel computing architecture. This enables even larger and more sophisticated models to be developed in practice. Until the arrival of the transformer, the dominant NLP models relied on recurrent and convolutional components. Additionally, the best sequence modeling and transduction problems, such as machine translation, rely on an encoder-decoder architecture with an attention mechanism to detect which parts of the input influence each part of the output. The transformer aims to replace the recurrent and convolutional components entirely with attention.
We use Recall@position as the main evaluation metric for the retriever. The obvious goal of the retriever is to have the highest recall possible at the lowest possible position. Since the final top position passages are re-ranked using the BERT-based reader, the fewer passages we need to evaluate the better the run time complexity and performance.