Goto

Collaborating Authors

 Mathematical & Statistical Methods


88c3c482430a62d35e03926a22e4b67e-Supplemental-Conference.pdf

Neural Information Processing Systems

CoLA and discuss modifications to improve lower precision performance. In Appendix D we expand on the details of the experiments in the main text. We now present the linear algebra identities that we use to exploit structure in CoLA. Finally, for sum we have the Woodbury identity and its variants. Besides the compositional operators, we have some rules for some special operators.


Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra

Neural Information Processing Systems

Many areas of machine learning and science involve large linear algebra problems, such as eigendecompositions, solving linear systems, computing matrix exponentials, and trace estimation. The matrices involved often have Kronecker, convolutional, block diagonal, sum, or product structure. In this paper, we propose a simple but general framework for large-scale linear algebra problems in machine learning, named CoLA (Compositional Linear Algebra).


On Contrastive Representations of Stochastic Processes Emile Mathieu, Adam Foster

Neural Information Processing Systems

Learning representations of stochastic processes is an emerging problem in machine learning with applications from meta-learning to physical object models to time series. Typical methods rely on exact reconstruction of observations, but this approach breaks down as observations become high-dimensional or noise distributions become complex.


A Continuized View on Nesterov Acceleration for Stochastic Gradient Descent and Randomized Gossip Mathieu Even 1,, Francis Bach

Neural Information Processing Systems

We introduce the "continuized" Nesterov acceleration, a close variant of Nesterov acceleration whose variables are indexed by a continuous time parameter. The two variables continuously mix following a linear ordinary differential equation and take gradient steps at random times. This continuized variant benefits from the best of the continuous and the discrete frameworks: as a continuous process, one can use differential calculus to analyze convergence and obtain analytical expressions for the parameters; and a discretization of the continuized process can be computed exactly with convergence rates similar to those of Nesterov original acceleration. We show that the discretization has the same structure as Nesterov acceleration, but with random parameters. We provide continuized Nesterov acceleration under deterministic as well as stochastic gradients, with either additive or multiplicative noise. Finally, using our continuized framework and expressing the gossip averaging problem as the stochastic minimization of a certain energy function, we provide the first rigorous acceleration of asynchronous gossip algorithms.


Private Low-Rank Approximation for Covariance Matrices, Dyson Brownian Motion, and Eigenvalue-Gap Bounds for Gaussian Perturbations

arXiv.org Artificial Intelligence

We consider the problem of approximating a $d \times d$ covariance matrix $M$ with a rank-$k$ matrix under $(\varepsilon,\delta)$-differential privacy. We present and analyze a complex variant of the Gaussian mechanism and obtain upper bounds on the Frobenius norm of the difference between the matrix output by this mechanism and the best rank-$k$ approximation to $M$. Our analysis provides improvements over previous bounds, particularly when the spectrum of $M$ satisfies natural structural assumptions. The novel insight is to view the addition of Gaussian noise to a matrix as a continuous-time matrix Brownian motion. This viewpoint allows us to track the evolution of eigenvalues and eigenvectors of the matrix, which are governed by stochastic differential equations discovered by Dyson. These equations enable us to upper bound the Frobenius distance between the best rank-$k$ approximation of $M$ and that of a Gaussian perturbation of $M$ as an integral that involves inverse eigenvalue gaps of the stochastically evolving matrix, as opposed to a sum of perturbation bounds obtained via Davis-Kahan-type theorems. Subsequently, again using the Dyson Brownian motion viewpoint, we show that the eigenvalues of the matrix $M$ perturbed by Gaussian noise have large gaps with high probability. These results also contribute to the analysis of low-rank approximations under average-case perturbations, and to an understanding of eigenvalue gaps for random matrices, both of which may be of independent interest.


Fast and Safe Scheduling of Robots

arXiv.org Artificial Intelligence

In this paper, we present an experimental analysis of a fast heuristic algorithm that was designed to generate a fast, collision-free schedule for a set of robots on a path graph. The experiments confirm the algorithm's effectiveness in producing collision-free schedules as well as achieving the optimal solution when all tasks assigned to the robots are of equal duration. Additionally, we provide an integer linear programming formulation that guarantees an optimal solution for this scheduling problem on any input graph, at the expense of significantly greater computational resources. We prove the correctness of our integer linear program. By comparing the solutions of these two algorithms, including the time required by the schedule itself, and the run time of each algorithm, we show that the heuristic algorithm is optimal or near optimal in nearly all cases, with a far faster run time than the integer linear program.


Bandit Optimal Transport

arXiv.org Machine Learning

Despite the impressive progress in statistical Optimal Transport (OT) in recent years, there has been little interest in the study of the \emph{sequential learning} of OT. Surprisingly so, as this problem is both practically motivated and a challenging extension of existing settings such as linear bandits. This article considers (for the first time) the stochastic bandit problem of learning to solve generic Kantorovich and entropic OT problems from repeated interactions when the marginals are known but the cost is unknown. We provide $\tilde{\mathcal O}(\sqrt{T})$ regret algorithms for both problems by extending linear bandits on Hilbert spaces. These results provide a reduction to infinite-dimensional linear bandits. To deal with the dimension, we provide a method to exploit the intrinsic regularity of the cost to learn, yielding corresponding regret bounds which interpolate between $\tilde{\mathcal O}(\sqrt{T})$ and $\tilde{\mathcal O}(T)$.


Supplementary Material for " Closing the Gap: Tighter Analysis of Alternating Stochastic Gradient Methods for Bilevel Problems "

Neural Information Processing Systems

A.1 Auxiliary Lemmas Throughout the proof, we use F We first present some results that will be used frequently in the proof. L 2ฮท (52) where (a) uses (18a) in Lemma 3. Recall that the lower-level function for the min-max problem is g(x, y; ฯ†) = f(x, y; ฮพ). B.2 Reduction from Theorem 1 to Proposition 3 In the min-max case, we apply Theorem 1 with ฮท = 1. These assumptions are mostly common in analyzing actor-critic method with linear value function approximation [50-52]. Assumption 9 is common in analyzing TD with linear function approximation; see e.g., [54, 55, 50].


Closing the Gap: Tighter Analysis of Alternating Stochastic Gradient Methods for Bilevel Problems

Neural Information Processing Systems

Stochastic nested optimization, including stochastic bilevel, min-max, and compositional optimization, is gaining popularity in many machine learning applications. While the three problems share a nested structure, existing works often treat them separately, thus developing problem-specific algorithms and analyses. Among various exciting developments, simple SGD-type updates (potentially on multiple variables) are still prevalent in solving this class of nested problems, but they are believed to have a slower convergence rate than non-nested problems.


Epidemic Learning: Boosting Decentralized Learning with Randomized Communication

Neural Information Processing Systems

We present Epidemic Learning (EL), a simple yet powerful decentralized learning (DL) algorithm that leverages changing communication topologies to achieve faster model convergence compared to conventional DL approaches. At each round of EL, each node sends its model updates to a random sample of s other nodes (in a system of n nodes). We provide an extensive theoretical analysis of EL, demonstrating that its changing topology culminates in superior convergence properties compared to the state-of-the-art (static and dynamic) topologies. Considering smooth nonconvex loss functions, the number of transient iterations for EL, i.e., the rounds required to achieve asymptotic linear speedup, is in O(