Interpolating between types and tokens by estimating power-law generators
Goldwater, Sharon, Johnson, Mark, Griffiths, Thomas L.
–Neural Information Processing Systems
Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that generically produce power-laws, augmenting standard generativemodels with an adaptor that produces the appropriate pattern of token frequencies. We show that taking a particular stochastic process - the Pitman-Yor process - as an adaptor justifies the appearance of type frequencies in formal analyses of natural language, and improves the performance of a model for unsupervised learning of morphology.
Neural Information Processing Systems
Dec-31-2006