Well File:

 Michael I. Jordan


Gradient Descent Can Take Exponential Time to Escape Saddle Points

Neural Information Processing Systems

Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points--it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.


Kernel Feature Selection via Conditional Covariance Minimization

Neural Information Processing Systems

We propose a method for feature selection that employs kernel-based measures of independence to find a subset of covariates that is maximally predictive of the response. Building on past work in kernel dimension reduction, we show how to perform feature selection via a constrained optimization problem involving the trace of the conditional covariance operator. We prove various consistency results for this procedure, and also demonstrate that our method compares favorably with other state-of-the-art algorithms on a variety of synthetic and real data sets.


Fast Black-box Variational Inference through Stochastic Trust-Region Optimization

Neural Information Processing Systems

We introduce TrustVI, a fast second-order algorithm for black-box variational inference based on trust-region optimization and the "reparameterization trick." At each iteration, TrustVI proposes and assesses a step based on minibatches of draws from the variational distribution.





Theoretical guarantees for EM under misspecified Gaussian mixture models

Neural Information Processing Systems

Recent years have witnessed substantial progress in understanding the behavior of EM for mixture models that are correctly specified. Given that model misspecification is common in practice, it is important to understand EM in this more general setting. We provide non-asymptotic guarantees for the population and sample-based EM algorithms when used to estimate parameters of certain misspecified Gaussian mixture models.


Information Constraints on Auto-Encoding Variational Bayes

Neural Information Processing Systems

Parameterizing the approximate posterior of a generative model with neural networks has become a common theme in recent machine learning research. While providing appealing flexibility, this approach makes it difficult to impose or assess structural constraints such as conditional independence. We propose a framework for learning representations that relies on auto-encoding variational Bayes, in which the search space is constrained via kernel-based measures of independence. In particular, our method employs the d-variable Hilbert-Schmidt Independence Criterion (dHSIC) to enforce independence between the latent representations and arbitrary nuisance factors. We show how this method can be applied to a range of problems, including problems that involve learning invariant and conditionally independent representations. We also present a full-fledged application to singlecell RNA sequencing (scRNA-seq). In this setting the biological signal is mixed in complex ways with sequencing errors and sampling effects. We show that our method outperforms the state-of-the-art approach in this domain.



Is Q-Learning Provably Efficient?

Neural Information Processing Systems

Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that model-free algorithms may require more samples to learn [7, 22]. The theoretical question of "whether model-free algorithms can be made sample efficient" is one of the most fundamental questions in RL, and remains unsolved even in the basic scenario with finitely many states and actions.