The leading approaches in language modeling are all obsessed with TV shows of my youth - namely Transformers and Sesame Street. Transformers this, Transformers that, and over here a bonfire worth of GPU-TPU-neuromorphic wafer scale silicon. We opt for the lazy path of old and proven techniques with a fancy crypto inspired acronym: the Single Headed Attention RNN (SHA-RNN). The author's lone goal is to show that the entire field might have evolved a different direction if we had instead been obsessed with a slightly different acronym and slightly different result. We take a previously strong language model based only on boring LSTMs and get it to within a stone's throw of a stone's throw of state-of-the-art byte level language model results on enwik8. This work has undergone no intensive hyperparameter optimization and lived entirely on a commodity desktop machine that made the author's small studio apartment far too warm in the midst of a San Franciscan summer. The final results are achievable in plus or minus 24 hours on a single GPU as the author is impatient. The attention mechanism is also readily extended to large contexts with minimal computation. Take that Sesame Street.
Many researchers and practitioners put an end to RNN and it's variants after the advent of Transformers but not the author. This piece of research is an eye opener for many who think compute is the only way. SoTA results are achievable under 24 hours on a single GPU "as the author is impatient " . Irrational as it seems I didn't want to use a cluster in the cloud somewhere, watching the dollars leave my bank account -- Author This paper is not solely about the architectures and achieving SoTA, but questioning the practices and direction of the community, to hear this in the author's words I'm entirely happy if this model fails, but why dismiss possibility out of hand? Let's start diving into the Paper.
It might have never occurred to you how you could make sense of what your friend is blabbering at a loud party. There are all kinds of noises in a party; then how come we are perfectly able to carry out a conversation? This question is known widely as the'cocktail party problem'. Most of our cognitive processes can pay attention to only a single activity at a time. In the case of a party house, our capability of directing attention towards one set of words while ignoring other sets of words, which are often overpowering, is still a conundrum.
Two of the most important aspects of machine learning models are feature extraction and feature engineering. Those features are what supply relevant information to the machine learning models. If the features are few or irrelevant, your model may have a hard time making any useful predictions. If there are too many features, your model will be slow and likely overfit. Humans don't necessarily know what feature representation are best for a given task.
The artificial intelligence sector sees over 14,000 papers published each year. This field attracts one of the most productive research groups globally. AI conferences like NeurIPS, ICML, ICLR, ACL and MLDS, among others, attract scores of interesting papers every year. The year 2019 saw an increase in the number of submissions. This year also saw noticeable trends like the increased usage of PyTorch as a framework for research increased by 194% among many others.