Athiwaratkun, Ben
Improving Consistency-Based Semi-Supervised Learning with Weight Averaging
Athiwaratkun, Ben, Finzi, Marc, Izmailov, Pavel, Wilson, Andrew Gordon
Recent advances in deep unsupervised learning have renewed interest in semi-supervised methods, which can learn from both labeled and unlabeled data. Presently the most successful approaches to semi-supervised learning are based on consistency regularization, whereby a model is trained to be robust to small perturbations of its inputs and parameters. We show that consistency regularization leads to flatter but narrower optima. We also show that the test error surface for these methods is approximately convex in regions of weight space traversed by SGD. Inspired by these observations, we propose to train consistency based semi-supervised models with stochastic weight averaging (SWA), a recent method which averages weights along the trajectory of SGD. We also develop fast-SWA, which further accelerates convergence by averaging multiple points within each cycle of a cyclical learning rate schedule. With fast-SWA we achieve the best known semi-supervised results on CIFAR-10 and CIFAR-100 over many different numbers of observed training labels. For example, we achieve 95.0% accuracy on CIFAR-10 with only 4000 labels, compared to the previous best result in the literature of 93.7%. We also improve the best known accuracy for domain adaptation from CIFAR-10 to STL from 80% to 83%. Finally, we show that with fast-SWA the simple $\Pi$ model becomes state-of-the-art for large labeled settings.
Probabilistic FastText for Multi-Sense Word Embeddings
Athiwaratkun, Ben, Wilson, Andrew Gordon, Anandkumar, Anima
We introduce Probabilistic FastText, a new model for word embeddings that can capture multiple word senses, sub-word structure, and uncertainty information. In particular, we represent each word with a Gaussian mixture density, where the mean of a mixture component is given by the sum of n-grams. This representation allows the model to share statistical strength across sub-word structures (e.g. Latin roots), producing accurate representations of rare, misspelt, or even unseen words. Moreover, each component of the mixture can capture a different word sense. Probabilistic FastText outperforms both FastText, which has no probabilistic model, and dictionary-level probabilistic embeddings, which do not incorporate subword structures, on several word-similarity benchmarks, including English RareWord and foreign language datasets. We also achieve state-of-art performance on benchmarks that measure ability to discern different meanings. Thus, the proposed model is the first to achieve multi-sense representations while having enriched semantics on rare words.
Hierarchical Density Order Embeddings
Athiwaratkun, Ben, Wilson, Andrew Gordon
By representing words with probability densities rather than point vectors, probabilistic word embeddings can capture rich and interpretable semantic information and uncertainty. The uncertainty information can be particularly meaningful in capturing entailment relationships -- whereby general words such as "entity" correspond to broad distributions that encompass more specific words such as "animal" or "instrument". We introduce density order embeddings, which learn hierarchical representations through encapsulation of probability densities. In particular, we propose simple yet effective loss functions and distance metrics, as well as graph-based schemes to select negative samples to better learn hierarchical density representations. Our approach provides state-of-the-art performance on the WordNet hypernym relationship prediction task and the challenging HyperLex lexical entailment dataset -- while retaining a rich and interpretable density representation.
Multimodal Word Distributions
Athiwaratkun, Ben, Wilson, Andrew Gordon
Word embeddings provide point representations of words containing useful semantic information. We introduce multimodal word distributions formed from Gaussian mixtures, for multiple word meanings, entailment, and rich uncertainty information. To learn these distributions, we propose an energy-based max-margin objective. We show that the resulting approach captures uniquely expressive semantic information, and outperforms alternatives, such as word2vec skip-grams, and Gaussian embeddings, on benchmark datasets such as word similarity and entailment.