Collaborating Authors

Intuitive Introduction to BERT – MachineCurve


Transformers are taking the world of NLP by storm. After being introduced in Vaswani et al.'s Attention is all you need work back in 2017, they – and particularly their self-attention mechanism requiring no recurrent elements to be used anymore – have proven to show state-of-the-art performance on a wide variety of language tasks. Nevertheless, what's good can still be improved, and this process has been applied to Transformers as well. After the introduction of the'vanilla' Transformer by Vaswani and colleagues, a group of people at OpenAI have used just the decoder segment and built a model that works great. However, according to Devlin et al., the authors of a 2018 paper about pretrained Transformers in NLP, they do one thing wrong: the attention that they apply is unidirectional. This hampers learning unnecessarily, they argue, and they proposed a bidirectional variant instead: BERT, or Bidirectional Encoder Representations from Transformers.

Understanding Transformers in NLP: State-of-the-Art Models


I love being a data scientist working in Natural Language Processing (NLP) right now. The breakthroughs and developments are occurring at an unprecedented pace. From the super-efficient ULMFiT framework to Google's BERT, NLP is truly in the midst of a golden era. And at the heart of this revolution is the concept of the Transformer. This has transformed the way we data scientists work with text data – and you'll soon see how in this article.

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)


The year 2018 has been an inflection point for machine learning models handling text (or more accurately, Natural Language Processing or NLP for short). Our conceptual understanding of how best to represent words and sentences in a way that best captures underlying meanings and relationships is rapidly evolving. Moreover, the NLP community has been putting forward incredibly powerful components that you can freely download and use in your own models and pipelines (It's been referred to as NLP's ImageNet moment, referencing how years ago similar developments accelerated the development of machine learning in Computer Vision tasks). But I couldn't think of anything else..) One of the latest milestones in this development is the release of BERT, an event described as marking the beginning of a new era in NLP. BERT is a model that broke several records for how well models can handle language-based tasks.

8 Leading Language Models For NLP In 2020


The introduction of transfer learning and pretrained language models in natural language processing (NLP) pushed forward the limits of language understanding and generation. Transfer learning and applying transformers to different downstream NLP tasks have become the main trend of the latest research advances. At the same time, there is a controversy in the NLP community regarding the research value of the huge pretrained language models occupying the leaderboards. While lots of AI experts agree with Anna Rogers's statement that getting state-of-the-art results just by using more data and computing power is not research news, other NLP opinion leaders point out some positive moments in the current trend, like, for example, the possibility of seeing the fundamental limitations of the current paradigm. Anyway, the latest improvements in NLP language models seem to be driven not only by the massive boosts in computing capacity but also by the discovery of ingenious ways to lighten models while maintaining high performance.

Understanding BERT


BERT (Bidirectional Encoder Representations from Transformers) is a research paper published by Google AI language. Unlike previous versions of NLP architectures, BERT is conceptually simple and empirically powerful. BERT has a benefit over another standard LM because it applies deep bidirectional context training of the sequence meaning it considers both left and right context while training whereas other LM model such as OpenAI GPT is unidirectional, every token can only attend to previous tokens in attention layers. Such restrictions are suboptimal for sentence-level tasks (paraphrasing) or token level tasks (named entity recognition, question-answering), where it is crucial to incorporate context from both directions. In earlier versions of LM, such as Glove, we have fixed embeddings of the words. For example, for the word "right," the embedding is the same irrespective of its context in the sentence.