Goto

Collaborating Authors

 Mathematical & Statistical Methods


Review for NeurIPS paper: SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence

Neural Information Processing Systems

The reviewers had some concerns about the empirical evaluation and the lack of discrete-time results, but agree that this paper would be a useful addition to NeurIPS. Please see the reviews (and your response) for ways to improve the final manuscript (especially in terms of clarifying the context and scope of this work).


Reviews: Learning Erdos-Renyi Random Graphs via Edge Detecting Queries

Neural Information Processing Systems

This paper studies a learning task of Erdos-Renyi graph with edge detecting queries. Each query or test is given as a set of input (a set of nodes) X representing a set of nodes and output Y(X) 0 or 1 representing the absence or presence of at lease one edge in the node set X. The authors derive the theoretical and algorithmic bounds of the number of tests required for achieving the perfect reconstruction of the graph in the asymptotic limit where the number of nodes n tends to be infinity, using three known algorithms: COMP, DD, and SSS. The derived analytic form is tight up to the asymptotic constant. A new algorithm is also invented based on GROTESQUE algorithm, achieving a sublinear computational time in the decoding step which is near optimal compared to the derived bounds.


Reviews: Learning Erdos-Renyi Random Graphs via Edge Detecting Queries

Neural Information Processing Systems

The main contributions are threefold: - An algorithm-independent probabilistic lower bound of the number of non-adaptive tests is given via Fano's inequality (Theorem 1). Under the Bernoulli testing, upper bounds are provided for COMP (Theorem 2) and DD (Theorem 3), and a lower bound is provided for SSS (Theorem 4). These in particular show that DD is asymptotically optimal under the Bernoulli testing when \theta 1/2. Although the three review scores exhibited a relatively large split in the initial round of review, after the authors' rebuttal as well as discussion among the reviewers, all the three reviewers have become positive ratings. I would thus like to recommend acceptance of this paper.


Quantitative Error Bounds for Scaling Limits of Stochastic Iterative Algorithms

arXiv.org Machine Learning

Stochastic iterative algorithms, including stochastic gradient descent (SGD) and stochastic gradient Langevin dynamics (SGLD), are widely utilized for optimization and sampling in large-scale and high-dimensional problems in machine learning, statistics, and engineering. Numerous works have bounded the parameter error in, and characterized the uncertainty of, these approximations. One common approach has been to use scaling limit analyses to relate the distribution of algorithm sample paths to a continuous-time stochastic process approximation, particularly in asymptotic setups. Focusing on the univariate setting, in this paper, we build on previous work to derive non-asymptotic functional approximation error bounds between the algorithm sample paths and the Ornstein-Uhlenbeck approximation using an infinite-dimensional version of Stein's method of exchangeable pairs. We show that this bound implies weak convergence under modest additional assumptions and leads to a bound on the error of the variance of the iterate averages of the algorithm. Furthermore, we use our main result to construct error bounds in terms of two common metrics: the L\'{e}vy-Prokhorov and bounded Wasserstein distances. Our results provide a foundation for developing similar error bounds for the multivariate setting and for more sophisticated stochastic approximation algorithms.


Reviews: Learning from Rational Behavior: Predicting Solutions to Unknown Linear Programs

Neural Information Processing Systems

The paper provides learning algorithm for predicting the solution of linear program with partial observation and provides some mistake bounds for these algorithms. Two different problems are considered: 1) The objective function of LP is known but the constraint sets are not known. At each time a new constraint is revealed and the goal is to predict the optimal solution of the LP including all known and unknown constraints. In the first problem, the mistake bounds are obtained for two cases: when the new constraints are selected adversarially, in which case the authors show that given a uniform bound on the precision of the constraints and some other assumptions such as non-degeneracy of the constraints, there exists an algorithm with mistake bound and running time linear in terms of the number of edges of the underlying unknown polytope and the number of precision bits. It is shown that uniform upper bound on the precision bits is essential for dimensions higher than 2. To relax this issue, the authors consider the known objective LP problem in stochastic scenario where the revealed constraints at each time are independently chosen from an underlying unknown distribution, in which case a learning algorithm by expanding convex hulls is provided which attains in expectation at most linear mistakes in terms of the number of edges and grows only logarithmic in time.


Reviews: Without-Replacement Sampling for Stochastic Gradient Methods

Neural Information Processing Systems

The paper studies the problem of minimizing the average of a finite sum of convex functions over a convex domain using stochastic algorithms that, opposed to most popular methods, apply WITHOUT-replacement sampling to the data. More specifically, the paper considers methods that first randomly permute the functions, and then process the functions via incremental updates one at a time, making at most a single pass over the data (hence the data is only shuffled once). There are three main motivations for considering stochastic methods with without-replacement sampling: 1. it is observed many times empirically (and to the best of my knowledge this is what ML practitioners really do) that applying random shuffles to the data and then apply incremental updates works better than with-replacement SGD. 2. After the data is randomly permuted, these algorithms require only sequential access to the memory, which is much more efficient than standard with-replacement SGD methods that require random access to perform the updates. Since the main setting under consideration here is when making at most a single pass, it sounds plausible to assume that the data is already stored in some random permutation, and hence the algorithm is fully sequential, and there is no need to even artificially permute the data. In a distributed setting in which data is partitioned to several machine (not assuming the data is sampled i.i.d from a distribution), it is more natural and efficient to analyze without-replacement algorithms.


Reviews: Interpretable Nonlinear Dynamic Modeling of Neural Trajectories

Neural Information Processing Systems

Overall I found the paper to be solid and rather enjoyable, and I would qualify it as a strong candidate for a poster. The authors' method of plotting velocity fields by decomposing the velocity into direction and speed, which they've apparently introduced, is especially effective. It made their arguments and conclusions much easier to follow, and will hopefully be picked up by others. In my opinion stating that this approach leads to "interpretable models" might be somewhat overselling the results – the interpretability of the results is still hampered by the fact that models are composed by 10-100 more or less arbitrary basis functions. That being said, their capacity to reproduce salient features of the phase diagram certainly makes them more interpretable than, say, recurrent neural networks.


Reviews: Adaptive Newton Method for Empirical Risk Minimization to Statistical Accuracy

Neural Information Processing Systems

The paper contains a novel and interesting idea, and is well written. I think it should be accepted. This is one of a very few papers which attempt to combine optimization and statistical considerations. Most papers suggesting algorithms for solving the ERM problem simply focus on solving the deterministic ERM problem, and ignore the fact that ERM objective is an approximation to the true loss which arises as an expectation of the loss over an unknown sample distribution; and hence is subject to an approximation error. This is of course known in the literature, but it is notoriously difficult to address both the optimization aspect and the approximation aspect in a meaningful way in a single work.


Reviews: Sub-sampled Newton Methods with Non-uniform Sampling

Neural Information Processing Systems

Pros: This paper is well written and clear. The authors do a good job analyzing their method from a theoretical standpoint. I like that this paper has good theory. I like the kinds of experiments the authors chose, and how they are presented. All in all I think this paper is good, and is a solid contribution to the literature on approximate Newton methods.


Reviews: Exploiting the Structure: Stochastic Gradient Methods Using Raw Clusters

Neural Information Processing Systems

The initial motivation seems to be the work of Hoffman et al on the use of clustering to speedup stochastic methods for ERM. Their method was not proved to converge to the optimal due to the use of biased stochastic gradients. Also, that work seemed to work only for small clusters due to the approach chosen. This papers goes a long way to develop the basic idea into a satisfying theoretical framework which also gives rise to efficient implementations. This paper is truly a pleasure to read – a very fine example of academic exposition.