Calibration, Entropy Rates, and Memory in Language Models
Braverman, Mark, Chen, Xinyi, Kakade, Sham M., Narasimhan, Karthik, Zhang, Cyril, Zhang, Yi
Recent advances in language modeling have resulted in significant breakthroughs on a wide variety of benchmarks in natural language processing Dai et al. [2018], Gong et al. [2018], Takase et al. [2018]. Capturing long-term dependencies has especially been a major focus, with approaches ranging from explicit memory-based neural networks Grave et al. [2016], Ke et al. [2018] to optimization improvements aimed at stabilizing training Le et al. [2015], Trinh et al. [2018]. In this paper, we address a basic question: how do the long-term dependencies in a language model's generations compare to those of the underlying language? Furthermore, if there are measurable discrepancies, this leads to the question of whether and how we can use them to improve these models. Starting from Shannon's seminal work that essentially introduced statistical language modeling Shannon [1951], the most classical and widely studied long-term property of a language model is its entropy rate -- the average amount of information contained per word, conditioned on the preceding words. A learned model provides an upper bound for the entropy rate of a language, via its cross-entropy loss. The exponential of the entropy rate can be interpreted as the effective support size of the distribution of the next word (intuitively, the average number of "plausible" word choices to continue a document), and the perplexity score of a model (the exponential of the cross entropy loss) is an upper bound for this quantity.
Jun-11-2019
- Country:
- North America > United States > California (0.14)
- Genre:
- Research Report (1.00)
- Industry:
- Banking & Finance (1.00)
- Energy (0.93)
- Government (1.00)
- Health & Medicine > Health Care Providers & Services (0.68)
- Law (1.00)
- Leisure & Entertainment > Games
- Computer Games (1.00)
- Technology: