The evolution of Tokenization in NLP -- Byte Pair Encoding in NLP
NLP may have been a little late to the AI epiphany but it is doing wonders with organizations like Google, OpenAI releasing state-of-the-art(SOTA) language models like BERT and GPT-2/3 respectively. GitHub Copilot and OpenAI codex are among a few very popular applications that are in the news. As someone who has very limited exposure to NLP, I decided to take up NLP as an area of research and the next few blogs/videos will be me sharing what I learn after dissecting some important components of NLP. Top Deep Learning models like BERT, GPT-2, or GPT-3 all share the same components but with different architectures that distinguish one model from another. In this newsletter(and notebook), we are going to focus on the basics of the first component of an NLP pipeline which is tokenization.
Oct-5-2021, 13:10:04 GMT
- Technology: