Viswanath, Pramod
Communication Algorithms via Deep Learning
Kim, Hyeji, Jiang, Yihan, Rana, Ranvir, Kannan, Sreeram, Oh, Sewoong, Viswanath, Pramod
Coding theory is a central discipline underpinning wireline and wireless modems that are the workhorses of the information age. Progress in coding theory is largely driven by individual human ingenuity with sporadic breakthroughs over the past century. In this paper we study whether it is possible to automate the discovery of decoding algorithms via deep learning. We study a family of sequential codes parameterized by recurrent neural network (RNN) architectures. We show that creatively designed and trained RNN architectures can decode well known sequential codes such as the convolutional and turbo codes with close to optimal performance on the additive white Gaussian noise (AWGN) channel, which itself is achieved by breakthrough algorithms of our times (Viterbi and BCJR decoders, representing dynamic programing and forward-backward algorithms). We show strong generalizations, i.e., we train at a specific signal to noise ratio and block length but test at a wide range of these quantities, as well as robustness and adaptivity to deviations from the AWGN setting.
All-but-the-Top: Simple and Effective Postprocessing for Word Representations
Mu, Jiaqi, Bhat, Suma, Viswanath, Pramod
Real-valued word representations have transformed NLP applications; popular examples are word2vec and GloVe, recognized for their ability to capture linguistic regularities. In this paper, we demonstrate a {\em very simple}, and yet counter-intuitive, postprocessing technique -- eliminate the common mean vector and a few top dominating directions from the word vectors -- that renders off-the-shelf representations {\em even stronger}. The postprocessing is empirically validated on a variety of lexical-level intrinsic tasks (word similarity, concept categorization, word analogy) and sentence-level tasks (semantic textural similarity and { text classification}) on multiple datasets and with a variety of representation methods and hyperparameter choices in multiple languages; in each case, the processed representations are consistently better than the original ones.
Discovering Potential Correlations via Hypercontractivity
Kim, Hyeji, Gao, Weihao, Kannan, Sreeram, Oh, Sewoong, Viswanath, Pramod
Discovering a correlation from one variable to another variable is of fundamental scientific and practical interest. While existing correlation measures are suitable for discovering average correlation, they fail to discover hidden or potential correlations. To bridge this gap, (i) we postulate a set of natural axioms that we expect a measure of potential correlation to satisfy; (ii) we show that the rate of information bottleneck, i.e., the hypercontractivity coefficient, satisfies all the proposed axioms; (iii) we provide a novel estimator to estimate the hypercontractivity coefficient from samples; and (iv) we provide numerical experiments demonstrating that this proposed estimator discovers potential correlations among various indicators of WHO datasets, is robust in discovering gene interactions from gene expression time series data, and is statistically more powerful than the estimators for other correlation measures in binary hypothesis testing of canonical examples of potential correlations.
Estimating Mutual Information for Discrete-Continuous Mixtures
Gao, Weihao, Kannan, Sreeram, Oh, Sewoong, Viswanath, Pramod
Estimation of mutual information from observed samples is a basic primitive in machine learning, useful in several learning tasks including correlation mining, information bottleneck, Chow-Liu tree, and conditional independence testing in (causal) graphical models. While mutual information is a quantity well-defined for general probability spaces, estimators have been developed only in the special case of discrete or continuous pairs of random variables. Most of these estimators operate using the 3H-principle, i.e., by calculating the three (differential) entropies of X, Y and the pair (X, Y). However, in general mixture spaces, such individual entropies are not well defined, even though mutual information is. In this paper, we develop a novel estimator for estimating mutual information in discrete-continuous mixtures. We prove the consistency of this estimator theoretically as well as demonstrate its excellent empirical performance. This problem is relevant in a wide-array of applications, where some variables are discrete, some continuous, and others are a mixture between continuous and discrete components.
Discovering Potential Correlations via Hypercontractivity
Kim, Hyeji, Gao, Weihao, Kannan, Sreeram, Oh, Sewoong, Viswanath, Pramod
Discovering a correlation from one variable to another variable is of fundamental scientific and practical interest. While existing correlation measures are suitable for discovering average correlation, they fail to discover hidden or potential correlations. To bridge this gap, (i) we postulate a set of natural axioms that we expect a measure of potential correlation to satisfy; (ii) we show that the rate of information bottleneck, i.e., the hypercontractivity coefficient, satisfies all the proposed axioms; (iii) we provide a novel estimator to estimate the hypercontractivity coefficient from samples; and (iv) we provide numerical experiments demonstrating that this proposed estimator discovers potential correlations among various indicators of WHO datasets, is robust in discovering gene interactions from gene expression time series data, and is statistically more powerful than the estimators for other correlation measures in binary hypothesis testing of canonical examples of potential correlations.
Geometry of Compositionality
Gong, Hongyu (University of Illinois at Urbana Champaign) | Bhat, Suma (University of Illinois at Urbana Champaign) | Viswanath, Pramod (University of Illinois at Urbana Champaign)
This paper proposes a simple test for compositionality (i.e., literal usage) of a word or phrase in a context-specific way. The test is computationally simple, relying on no external resources and only uses a set of trained word vectors. Experiments show that the proposed method is competitive with state of the art and displays high accuracy in context-specific compositionality detection of a variety of natural language phenomena (idiomaticity, sarcasm, metaphor) for different datasets in multiple languages. The key insight is to connect compositionality to a curious geometric property of word embeddings, which is of independent interest.
Breaking the Bandwidth Barrier: Geometrical Adaptive Entropy Estimation
Gao, Weihao, Oh, Sewoong, Viswanath, Pramod
Estimators of information theoretic measures such as entropy and mutual information from samples are a basic workhorse for many downstream applications in modern data science. State of the art approaches have been either geometric (nearest neighbor (NN) based) or kernel based (with bandwidth chosen to be data independent and vanishing sub linearly in the sample size). In this paper we combine both these approaches to design new estimators of entropy and mutual information that strongly outperform all state of the art methods. Our estimator uses bandwidth choice of fixed $k$-NN distances; such a choice is both data dependent and linearly vanishing in the sample size and necessitates a bias cancellation term that is universal and independent of the underlying distribution. As a byproduct, we obtain a unified way of obtaining both kernel and NN estimators. The corresponding theoretical contribution relating the geometry of NN distances to asymptotic order statistics is of independent mathematical interest.
Geometry of Polysemy
Mu, Jiaqi, Bhat, Suma, Viswanath, Pramod
Vector representations of words have heralded a transformational approach to classical problems in NLP; the most popular example is word2vec. However, a single vector does not suffice to model the polysemous nature of many (frequent) words, i.e., words with multiple meanings. In this paper, we propose a three-fold approach for unsupervised polysemy modeling: (a) context representations, (b) sense induction and disambiguation and (c) lexeme (as a word and sense pair) representations. A key feature of our work is the finding that a sentence containing a target word is well represented by a low rank subspace, instead of a point in a vector space. We then show that the subspaces associated with a particular sense of the target word tend to intersect over a line (one-dimensional subspace), which we use to disambiguate senses using a clustering algorithm that harnesses the Grassmannian geometry of the representations. The disambiguation algorithm, which we call $K$-Grassmeans, leads to a procedure to label the different senses of the target word in the corpus -- yielding lexeme vector representations, all in an unsupervised manner starting from a large (Wikipedia) corpus in English. Apart from several prototypical target (word,sense) examples and a host of empirical studies to intuit and justify the various geometric representations, we validate our algorithms on standard sense induction and disambiguation datasets and present new state-of-the-art results.
Breaking the Bandwidth Barrier: Geometrical Adaptive Entropy Estimation
Gao, Weihao, Oh, Sewoong, Viswanath, Pramod
Estimators of information theoretic measures such as entropy and mutual information are a basic workhorse for many downstream applications in modern data science. State of the art approaches have been either geometric (nearest neighbor (NN) based) or kernel based (with a globally chosen bandwidth). In this paper, we combine both these approaches to design new estimators of entropy and mutual information that outperform state of the art methods. Our estimator uses local bandwidth choices of $k$-NN distances with a finite $k$, independent of the sample size. Such a local and data dependent choice improves performance in practice, but the bandwidth is vanishing at a fast rate, leading to a non-vanishing bias. We show that the asymptotic bias of the proposed estimator is universal; it is independent of the underlying distribution. Hence, it can be pre-computed and subtracted from the estimate. As a byproduct, we obtain a unified way of obtaining both kernel and NN estimators. The corresponding theoretical contribution relating the asymptotic geometry of nearest neighbors to order statistics is of independent mathematical interest.
Demystifying Fixed k-Nearest Neighbor Information Estimators
Gao, Weihao, Oh, Sewoong, Viswanath, Pramod
Estimating mutual information from i.i.d. samples drawn from an unknown joint density function is a basic statistical problem of broad interest with multitudinous applications. The most popular estimator is one proposed by Kraskov and St\"ogbauer and Grassberger (KSG) in 2004, and is nonparametric and based on the distances of each sample to its $k^{\rm th}$ nearest neighboring sample, where $k$ is a fixed small integer. Despite its widespread use (part of scientific software packages), theoretical properties of this estimator have been largely unexplored. In this paper we demonstrate that the estimator is consistent and also identify an upper bound on the rate of convergence of the bias as a function of number of samples. We argue that the superior performance benefits of the KSG estimator stems from a curious "correlation boosting" effect and build on this intuition to modify the KSG estimator in novel ways to construct a superior estimator. As a byproduct of our investigations, we obtain nearly tight rates of convergence of the $\ell_2$ error of the well known fixed $k$ nearest neighbor estimator of differential entropy by Kozachenko and Leonenko.