Mathematical & Statistical Methods
Reviews: GIANT: Globally Improved Approximate Newton Method for Distributed Optimization
The paper introduces GIANT, a distributed variant of Newton algorithm. The considered problem is important and the paper gives a nice contribution to the field of distributed optimisation. The paper is very clear and nice to read, and propose nice theoretical contributions and experiments, with a detailed bibliography and positioning with respect to priori work. Here is my main criticism: * Authors acknowledge that their approach is close to previous works, namely DANE, for which GIANT seem to coincide to DANE in the least-squares loss case. However, the rate obtained in the paper is much better, certainly thanks to the introduction of the incoherence assumption, which is well known in the field of compressed sensing and randomized linear algebra.
Model Predictive Control is Almost Optimal for Restless Bandit
Gast, Nicolas, Narasimha, Dheeraj
We consider the discrete time infinite horizon average reward restless markovian bandit (RMAB) problem. We propose a \emph{model predictive control} based non-stationary policy with a rolling computational horizon $\tau$. At each time-slot, this policy solves a $\tau$ horizon linear program whose first control value is kept as a control for the RMAB. Our solution requires minimal assumptions and quantifies the loss in optimality in terms of $\tau$ and the number of arms, $N$. We show that its sub-optimality gap is $O(1/\sqrt{N})$ in general, and $\exp(-\Omega(N))$ under a local-stability condition. Our proof is based on a framework from dynamic control known as \emph{dissipativity}. Our solution easy to implement and performs very well in practice when compared to the state of the art. Further, both our solution and our proof methodology can easily be generalized to more general constrained MDP settings and should thus, be of great interest to the burgeoning RMAB community.
Reviews: Exploiting Numerical Sparsity for Efficient Learning : Faster Eigenvector Computation and Regression
The authors introduce numerical sparsity'' as a new parameter to measure the efficiency of algorithms for solving these problems. For a vector x, its numerical sparsity is defined to be x _1 2 / x _2 2. This quantity is a softer'' version of sparsity and is never larger than the l_0 sparsity of x. The main new insight of this paper is that, it is possible to achieve better running time if we measure the efficiency of algorithms using this new parameter. To accomplish this goal, instead of devising completely new convex optimization methods, the authors develop new stochastic gradient oracle for these problems by importance sampling the matrix A (Section 3), whose variance bound depends on the numerical sparsity of rows of A, and then directly apply existing stochastic first order methods to solve l_2 regression and top eigenvector computation (Section 4). The main new technical result in this paper is an unbiased stochastic oracle to estimate A T Ax for a matrix A and a vector x, which can serve as a stochastic gradient oracle when solving l_2 regression and top eigenvector computation. The variance of the oracle depends on the numerical sparsity of rows of A. I found this idea interesting and natural: consider a sparse vector a, if we slightly perturb a then its l_0 sparsity would be large. However, it is still possible to estimate such vector a accurately by (i) finding out heavy coordinates of a and (ii) importance sampling coordinates according to their magnitudes. The authors develop new stochastic oracle to estimate A T Ax by formalizing these intuitions in Section 3. The main difficulty here is to relate the variance to the function error, and the authors achieve this goal by exploiting the numerical sparsity of rows of A. I found this sampling procedure interesting and may have applications in other machine learning / numerical linear algebra tasks. Overall, this is a good paper and I would recommend acceptance.
Reviews: The promises and pitfalls of Stochastic Gradient Langevin Dynamics
Review after rebuttal: I thank the author(s) for their response. While I still believe that this paper is a minor increment beyond what has already been done on SGLD, I agree that the message might be useful for some. I also appreciate the effort the authors have made in improving the manuscript based on reviews' suggestions, particularly their efforts to include relevant numerical experiments to ML scenarios, and recommendations beyond the CV approach which has been studied to exhaustion and rarely applicable in practice. Based on this, I've adjusted my decision to marginally above threshold. Original review: In the paper "The promises and pitfalls of Stochastic Gradient Langevin Dynamics" the authors revisit the Stochastic Langevin Gradient Dynamics (SGLD) approach to approximately sampling from a probability distribution using stochastic gradients (specifically subsampling). The authors compare a number of different classes of approximate inference method, including SGLD, LMC (known by some as Unadjusted Langevin Algorithm or ULA) and Stochastic Gradient Langevin Dynamics Fixed Point (SGLDFP) -- the latter being a variant of SGLD with a control variate exploiting the unimodality of the distribution, similar to what has been presented in [3, 25 and others].
Reviews: Total stochastic gradient algorithms and applications in reinforcement learning
This paper provides another formalism for gradient estimation in probabilistic computation graphs. Using pathwise derivative and likelihood ratio estimators, existing and well-known policy gradient theorems are cast into the proposed formalism. This intuition is then used to propose two new methods for gradient estimation that can be used in a model-based RL framework. Some results are shown that demonstrate comparable results to PILCO on the cart-pole task. Quality: the idea in this work is interesting, and the proposed framework and methods may prove useful in RL settings.
LevAttention: Time, Space, and Streaming Efficient Algorithm for Heavy Attentions
Kannan, Ravindran, Bhattacharyya, Chiranjib, Kacham, Praneeth, Woodruff, David P.
A central problem related to transformers can be stated as follows: given two $n \times d$ matrices $Q$ and $K$, and a non-negative function $f$, define the matrix $A$ as follows: (1) apply the function $f$ to each entry of the $n \times n$ matrix $Q K^T$, and then (2) normalize each of the row sums of $A$ to be equal to $1$. The matrix $A$ can be computed in $O(n^2 d)$ time assuming $f$ can be applied to a number in constant time, but the quadratic dependence on $n$ is prohibitive in applications where it corresponds to long context lengths. For a large class of functions $f$, we show how to find all the ``large attention scores", i.e., entries of $A$ which are at least a positive value $\varepsilon$, in time with linear dependence on $n$ (i.e., $n \cdot \textrm{poly}(d/\varepsilon)$) for a positive parameter $\varepsilon > 0$. Our class of functions include all functions $f$ of the form $f(x) = |x|^p$, as explored recently in transformer models. Using recently developed tools from randomized numerical linear algebra, we prove that for any $K$, there is a ``universal set" $U \subset [n]$ of size independent of $n$, such that for any $Q$ and any row $i$, the large attention scores $A_{i,j}$ in row $i$ of $A$ all have $j \in U$. We also find $U$ in $n \cdot \textrm{poly}(d/\varepsilon)$ time. Notably, we (1) make no assumptions on the data, (2) our workspace does not grow with $n$, and (3) our algorithms can be computed in streaming and parallel settings. We call the attention mechanism that uses only the subset of keys in the universal set as LevAttention since our algorithm to identify the universal set $U$ is based on leverage scores. We empirically show the benefits of our scheme for vision transformers, showing how to train new models that use our universal set while training as well, showing that our model is able to consistently select ``important keys'' during training.
Online Convex Optimization with a Separation Oracle
In this paper, we introduce a new projection-free algorithm for Online Convex Optimization (OCO) with a state-of-the-art regret guarantee among separation-based algorithms. Existing projection-free methods based on the classical Frank-Wolfe algorithm achieve a suboptimal regret bound of $O(T^{3/4})$, while more recent separation-based approaches guarantee a regret bound of $O(\kappa \sqrt{T})$, where $\kappa$ denotes the asphericity of the feasible set, defined as the ratio of the radii of the containing and contained balls. However, for ill-conditioned sets, $\kappa$ can be arbitrarily large, potentially leading to poor performance. Our algorithm achieves a regret bound of $\widetilde{O}(\sqrt{dT} + \kappa d)$, while requiring only $\widetilde{O}(1)$ calls to a separation oracle per round. Crucially, the main term in the bound, $\widetilde{O}(\sqrt{d T})$, is independent of $\kappa$, addressing the limitations of previous methods. Additionally, as a by-product of our analysis, we recover the $O(\kappa \sqrt{T})$ regret bound of existing OCO algorithms with a more straightforward analysis and improve the regret bound for projection-free online exp-concave optimization. Finally, for constrained stochastic convex optimization, we achieve a state-of-the-art convergence rate of $\widetilde{O}(\sigma/\sqrt{T} + \kappa d/T)$, where $\sigma$ represents the noise in the stochastic gradients, while requiring only $\widetilde{O}(1)$ calls to a separation oracle per iteration.
On Uncertainty In Natural Language Processing
The last decade in deep learning has brought on increasingly capable systems that are deployed on a wide variety of applications. In natural language processing, the field has been transformed by a number of breakthroughs including large language models, which are used in increasingly many user-facing applications. In order to reap the benefits of this technology and reduce potential harms, it is important to quantify the reliability of model predictions and the uncertainties that shroud their development. This thesis studies how uncertainty in natural language processing can be characterized from a linguistic, statistical and neural perspective, and how it can be reduced and quantified through the design of the experimental pipeline. We further explore uncertainty quantification in modeling by theoretically and empirically investigating the effect of inductive model biases in text classification tasks. The corresponding experiments include data for three different languages (Danish, English and Finnish) and tasks as well as a large set of different uncertainty quantification approaches. Additionally, we propose a method for calibrated sampling in natural language generation based on non-exchangeable conformal prediction, which provides tighter token sets with better coverage of the actual continuation. Lastly, we develop an approach to quantify confidence in large black-box language models using auxiliary predictors, where the confidence is predicted from the input to and generated output text of the target model alone.
A Disentangled Recognition and Nonlinear Dynamics Model for Unsupervised Learning
Marco Fraccaro, Simon Kamronn, Ulrich Paquet, Ole Winther
This paper takes a step towards temporal reasoning in a dynamically changing video, not in the pixel space that constitutes its frames, but in a latent space that describes the non-linear dynamics of the objects in its world. We introduce the Kalman variational auto-encoder, a framework for unsupervised learning of sequential data that disentangles two latent representations: an object's representation, coming from a recognition model, and a latent state describing its dynamics. As a result, the evolution of the world can be imagined and missing data imputed, both without the need to generate high dimensional frames at each time step. The model is trained end-to-end on videos of a variety of simulated physical systems, and outperforms competing methods in generative and missing data imputation tasks.
Robust Hypothesis Test for Nonlinear Effect with Gaussian Processes
Utilizing the theory of reproducing kernels, we reduce this hypothesis to a simple one-sided score test for a scalar parameter, develop a testing procedure that is robust against the misspecification of kernel functions, and also propose an ensemble-based estimator for the null model to guarantee test performance in small samples. To demonstrate the utility of the proposed method, we apply our test to the problem of detecting nonlinear interaction between groups of continuous features. We evaluate the finite-sample performance of our test under different data-generating functions and estimation strategies for the null model. Our results reveal interesting connections between notions in machine learning (model underfit/overfit) and those in statistical inference (i.e. Type I error/power of hypothesis test), and also highlight unexpected consequences of common model estimating strategies (e.g.