Goto

Collaborating Authors

 Goyal, Navin


On the Computational Power of Transformers and its Implications in Sequence Modeling

arXiv.org Machine Learning

Transformers are being used extensively across several sequence modeling tasks. Significant research effort has been devoted to experimentally probe the inner workings of Transformers. However, our conceptual and theoretical understanding of their power and inherent limitations is still nascent. In particular, the roles of various components in Transformers such as positional encodings, attention heads, residual connections, and feedforward networks, are not clear. In this paper, we take a step towards answering these questions. We analyze the computational power as captured by Turing-completeness. We first provide an alternate and simpler proof to show that vanilla Transformers are Turing-complete and then we prove that Transformers with only positional masking and without any positional encoding are also Turing-complete. We further analyze the necessity of each component for the Turing-completeness of the network; interestingly, we find that a particular type of residual connection is necessary. We demonstrate the practical implications of our results via experiments on machine translation and synthetic tasks.


Robust Identifiability in Linear Structural Equation Models of Causal Inference

arXiv.org Artificial Intelligence

In this work, we consider the problem of robust parameter estimation from observational data in the context of linear structural equation models (LSEMs). LSEMs are a popular and well-studied class of models for inferring causality in the natural and social sciences. One of the main problems related to LSEMs is to recover the model parameters from the observational data. Under various conditions on LSEMs and the model parameters the prior work provides efficient algorithms to recover the parameters. However, these results are often about generic identifiability. In practice, generic identifiability is not sufficient and we need robust identifiability: small changes in the observational data should not affect the parameters by a huge amount. Robust identifiability has received far less attention and remains poorly understood. Sankararaman et al. (2019) recently provided a set of sufficient conditions on parameters under which robust identifiability is feasible. However, a limitation of their work is that their results only apply to a small sub-class of LSEMs, called ``bow-free paths.'' In this work, we significantly extend their work along multiple dimensions. First, for a large and well-studied class of LSEMs, namely ``bow free'' models, we provide a sufficient condition on model parameters under which robust identifiability holds, thereby removing the restriction of paths required by prior work. We then show that this sufficient condition holds with high probability which implies that for a large set of parameters robust identifiability holds and that for such parameters, existing algorithms already achieve robust identifiability. Finally, we validate our results on both simulated and real-world datasets.


Non-Gaussianity of Stochastic Gradient Noise

arXiv.org Machine Learning

What enables Stochastic Gradient Descent (SGD) to achieve better generalization than Gradient Descent (GD) in Neural Network training? This question has attracted much attention. In this paper, we study the distribution of the Stochastic Gradient Noise (SGN) vectors during the training. We observe that for batch sizes 256 and above, the distribution is best described as Gaussian at-least in the early phases of training. This holds across data-sets, architectures, and other choices.


Effect of Activation Functions on the Training of Overparametrized Neural Nets

arXiv.org Machine Learning

It is well-known that overparametrized neural networks trained using gradient-based methods quickly achieve small training error with appropriate hyperparameter settings. Recent papers have proved this statement theoretically for highly overparametrized networks under reasonable assumptions. The limiting case when the network size approaches infinity has also been considered. These results either assume that the activation function is ReLU or they crucially depend on the minimum eigenvalue of a certain Gram matrix depending on the data, random initialization and the activation function. In the latter case, existing works only prove that this minimum eigenvalue is non-zero and do not provide quantitative bounds. On the empirical side, a contemporary line of investigations has proposed a number of alternative activation functions which tend to perform better than ReLU at least in some settings but no clear understanding has emerged. This state of affairs underscores the importance of theoretically understanding the impact of activation functions on training. In the present paper, we provide theoretical results about the effect of activation function on the training of highly overparametrized 2-layer neural networks. We show that for smooth activations, such as tanh and swish, the minimum eigenvalue can be exponentially small depending on the span of the dataset implying that the training can be very slow. In contrast, for activations with a "kink," such as ReLU, SELU, ELU, all eigenvalues are large under minimal assumptions on the data. Several new ideas are involved. Finally, we corroborate our results empirically.


Stability of Linear Structural Equation Models of Causal Inference

arXiv.org Machine Learning

We consider the numerical stability of the parameter recovery problem in Linear Structural Equation Model ($\LSEM$) of causal inference. A long line of work starting from Wright (1920) has focused on understanding which sub-classes of $\LSEM$ allow for efficient parameter recovery. Despite decades of study, this question is not yet fully resolved. The goal of this paper is complementary to this line of work; we want to understand the stability of the recovery problem in the cases when efficient recovery is possible. Numerical stability of Pearl's notion of causality was first studied in Schulman and Srivastava (2016) using the concept of condition number where they provide ill-conditioned examples. In this work, we provide a condition number analysis for the $\LSEM$. First we prove that under a sufficient condition, for a certain sub-class of $\LSEM$ that are \emph{bow-free} (Brito and Pearl (2002)), the parameter recovery is stable. We further prove that \emph{randomly} chosen input parameters for this family satisfy the condition with a substantial probability. Hence for this family, on a large subset of parameter space, recovery is numerically stable. Next we construct an example of $\LSEM$ on four vertices with \emph{unbounded} condition number. We then corroborate our theoretical findings via simulations as well as real-world experiments for a sociology application. Finally, we provide a general heuristic for estimating the condition number of any $\LSEM$ instance.


Non-Gaussian Component Analysis using Entropy Methods

arXiv.org Machine Learning

Non-Gaussian component analysis (NGCA) is a problem in multidimensional data analysis. Since its formulation in 2006, NGCA has attracted considerable attention in statistics and machine learning. In this problem, we have a random variable $X$ in $n$-dimensional Euclidean space. There is an unknown subspace $U$ of the $n$-dimensional Euclidean space such that the orthogonal projection of $X$ onto $U$ is standard multidimensional Gaussian and the orthogonal projection of $X$ onto $V$, the orthogonal complement of $U$, is non-Gaussian, in the sense that all its one-dimensional marginals are different from the Gaussian in a certain metric defined in terms of moments. The NGCA problem is to approximate the non-Gaussian subspace $V$ given samples of $X$. Vectors in $V$ corresponds to "interesting" directions, whereas vectors in $U$ correspond to the directions where data is very noisy. The most interesting applications of the NGCA model is for the case when the magnitude of the noise is comparable to that of the true signal, a setting in which traditional noise reduction techniques such as PCA don't apply directly. NGCA is also related to dimensionality reduction and to other data analysis problems such as ICA. NGCA-like problems have been studied in statistics for a long time using techniques such as projection pursuit. We give an algorithm that takes polynomial time in the dimension $n$ and has an inverse polynomial dependence on the error parameter measuring the angle distance between the non-Gaussian subspace and the subspace output by the algorithm. Our algorithm is based on relative entropy as the contrast function and fits under the projection pursuit framework. The techniques we develop for analyzing our algorithm maybe of use for other related problems.


Heavy-Tailed Analogues of the Covariance Matrix for ICA

AAAI Conferences

Independent Component Analysis (ICA) is the problem of learning a square matrix A, given samples of X = A S, where S is a random vector with independent coordinates. Most existing algorithms are provably efficient only when each S i has finite and moderately valued fourth moment. However, there are practical applications where this assumption need not be true, such as speech and finance. Algorithms have been proposed for heavy-tailed ICA, but they are not practical, using random walks and the full power of the ellipsoid algorithm multiple times. The main contributions of this paper are (1) A practical algorithm for heavy-tailed ICA that we call HTICA. We provide theoretical guarantees and show that it outperforms other algorithms in some heavy-tailed regimes, both on real and synthetic data. Like the current state-of-the-art, the new algorithm is based on the centroid body (a first moment analogue of the covariance matrix). Unlike the state-of-the-art, our algorithm is practically efficient. To achieve this, we use explicit analytic representations of the centroid body, which bypasses the use of the ellipsoid method and random walks. (2) We study how heavy tails affect different ICA algorithms, including HTICA. Somewhat surprisingly, we show that some algorithms that use the covariance matrix or higher moments can successfully solve a range of ICA instances with infinite second moment. We study this theoretically and experimentally, with both synthetic and real-world heavy-tailed data.


Heavy-tailed Independent Component Analysis

arXiv.org Machine Learning

Independent component analysis (ICA) is the problem of efficiently recovering a matrix $A \in \mathbb{R}^{n\times n}$ from i.i.d. observations of $X=AS$ where $S \in \mathbb{R}^n$ is a random vector with mutually independent coordinates. This problem has been intensively studied, but all existing efficient algorithms with provable guarantees require that the coordinates $S_i$ have finite fourth moments. We consider the heavy-tailed ICA problem where we do not make this assumption, about the second moment. This problem also has received considerable attention in the applied literature. In the present work, we first give a provably efficient algorithm that works under the assumption that for constant $\gamma > 0$, each $S_i$ has finite $(1+\gamma)$-moment, thus substantially weakening the moment requirement condition for the ICA problem to be solvable. We then give an algorithm that works under the assumption that matrix $A$ has orthogonal columns but requires no moment assumptions. Our techniques draw ideas from convex geometry and exploit standard properties of the multivariate spherical Gaussian distribution in a novel way.


The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures

arXiv.org Machine Learning

In this paper we show that very large mixtures of Gaussians are efficiently learnable in high dimension. More precisely, we prove that a mixture with known identical covariance matrices whose number of components is a polynomial of any fixed degree in the dimension n is polynomially learnable as long as a certain non-degeneracy condition on the means is satisfied. It turns out that this condition is generic in the sense of smoothed complexity, as soon as the dimensionality of the space is high enough. Moreover, we prove that no such condition can possibly exist in low dimension and the problem of learning the parameters is generically hard. In contrast, much of the existing work on Gaussian Mixtures relies on low-dimensional projections and thus hits an artificial barrier. Our main result on mixture recovery relies on a new "Poissonization"-based technique, which transforms a mixture of Gaussians to a linear map of a product distribution. The problem of learning this map can be efficiently solved using some recent results on tensor decompositions and Independent Component Analysis (ICA), thus giving an algorithm for recovering the mixture. In addition, we combine our low-dimensional hardness results for Gaussian mixtures with Poissonization to show how to embed difficult instances of low-dimensional Gaussian mixtures into the ICA setting, thus establishing exponential information-theoretic lower bounds for underdetermined ICA in low dimension. To the best of our knowledge, this is the first such result in the literature. In addition to contributing to the problem of Gaussian mixture learning, we believe that this work is among the first steps toward better understanding the rare phenomenon of the "blessing of dimensionality" in the computational aspects of statistical inference.


Thompson Sampling for Contextual Bandits with Linear Payoffs

arXiv.org Machine Learning

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state-of-the-art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze a generalization of Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied versions of the contextual bandits problem. We provide the first theoretical guarantees for the contextual version of Thompson Sampling. We prove a high probability regret bound of $\tilde{O}(d^{3/2}\sqrt{T})$ (or $\tilde{O}(d\sqrt{T \log(N)})$), which is the best regret bound achieved by any computationally efficient algorithm available for this problem in the current literature, and is within a factor of $\sqrt{d}$ (or $\sqrt{\log(N)}$) of the information-theoretic lower bound for this problem.