Goto

Collaborating Authors

 poly log





On the Theory of Continual Learning with Gradient Descent for Neural Networks

arXiv.org Machine Learning

Gradient-based methods are the primary approach for training ne ural networks. In recent years, research in learning theory has shown that neural networks can efficiently lea rn various data classes using empirical risk minimization (ERM) methods. In many real-world settings, data a rrive sequentially in a non-stationary manner, requiring the learner to maintain performance on past tas ks while acquiring new capabilities. In such cases, a learning model must be continually learnable, meaning it should retain previously acquired knowledge when trained on new tasks. On the other hand, various le arning systems, including deep learning architectures, can be prone to catastrophic forgetting, that is, updating a model on new data causes a dramatic drop in performance on previously learned tasks [ McCloskey and Cohen, 1989, Goodfellow et al., 2013 ]. The goal of continual (lifelong) learning is to develop models and methods that, even without retraining on old data, experience minimal forgetting when incorporating new inform ation. Despite deep learning's ubiquity, characterizing the power and limitat ions of neural networks is still an ongoing research direction. While several recent works aim to unde rstand the power of gradient descent (GD) for training neural networks with stylized data distributions, these works are still limited to single-task scenarios (for some examples see [ Du et al., 2019, Bartlett et al., 2021, Abbe et al., 2022 ]).


Kernel ridge regression under power-law data: spectrum and generalization

arXiv.org Machine Learning

In this work, we investigate high-dimensional kernel ridge regression (KRR) on i.i.d. Gaussian data with anisotropic power-law covariance. This setting differs fundamentally from the classical source & capacity conditions for KRR, where power-law assumptions are typically imposed on the kernel eigen-spectrum itself. Our contributions are twofold. First, we derive an explicit characterization of the kernel spectrum for polynomial inner-product kernels, giving a precise description of how the kernel eigen-spectrum inherits the data decay. Second, we provide an asymptotic analysis of the excess risk in the high-dimensional regime for a particular kernel with this spectral behavior, showing that the sample complexity is governed by the effective dimension of the data rather than the ambient dimension. These results establish a fundamental advantage of learning with power-law anisotropic data over isotropic data. To our knowledge, this is the first rigorous treatment of non-linear KRR under power-law data.



Differentiable Quantum Computing for Large-scale Linear Control

arXiv.org Artificial Intelligence

As industrial models and designs grow increasingly complex, the demand for optimal control of large-scale dynamical systems has significantly increased. However, traditional methods for optimal control incur significant overhead as problem dimensions grow. In this paper, we introduce an end-to-end quantum algorithm for linear-quadratic control with provable speedups. Our algorithm, based on a policy gradient method, incorporates a novel quantum subroutine for solving the matrix Lyapunov equation. Specifically, we build a quantum-assisted differentiable simulator for efficient gradient estimation that is more accurate and robust than classical methods relying on stochastic approximation. Compared to the classical approaches, our method achieves a super-quadratic speedup. To the best of our knowledge, this is the first end-to-end quantum application to linear control problems with provable quantum advantage.


Des-q: a quantum algorithm to construct and efficiently retrain decision trees for regression and binary classification

arXiv.org Artificial Intelligence

Decision trees are widely used in machine learning due to their simplicity in construction and interpretability. However, as data sizes grow, traditional methods for constructing and retraining decision trees become increasingly slow, scaling polynomially with the number of training examples. In this work, we introduce a novel quantum algorithm, named Des-q, for constructing and retraining decision trees in regression and binary classification tasks. Assuming the data stream produces small increments of new training examples, we demonstrate that our Des-q algorithm significantly reduces the time required for tree retraining, achieving a poly-logarithmic time complexity in the number of training examples, even accounting for the time needed to load the new examples into quantum-accessible memory. Our approach involves building a decision tree algorithm to perform k-piecewise linear tree splits at each internal node. These splits simultaneously generate multiple hyperplanes, dividing the feature space into k distinct regions. To determine the k suitable anchor points for these splits, we develop an efficient quantum-supervised clustering method, building upon the q-means algorithm of Kerenidis et al. Des-q first efficiently estimates each feature weight using a novel quantum technique to estimate the Pearson correlation. Subsequently, we employ weighted distance estimation to cluster the training examples in k disjoint regions and then proceed to expand the tree using the same procedure. We benchmark the performance of the simulated version of our algorithm against the state-of-the-art classical decision tree for regression and binary classification on multiple data sets with numerical features. Further, we showcase that the proposed algorithm exhibits similar performance to the state-of-the-art decision tree while significantly speeding up the periodic tree retraining.


Consistent Risk Estimation in High-Dimensional Linear Regression

arXiv.org Machine Learning

Risk estimation is at the core of many learning systems. The importance of this problem has motivated researchers to propose different schemes, such as cross validation, generalized cross validation, and Bootstrap. The theoretical properties of such estimates have been extensively studied in the low-dimensional settings, where the number of predictors $p$ is much smaller than the number of observations $n$. However, a unifying methodology accompanied with a rigorous theory is lacking in high-dimensional settings. This paper studies the problem of risk estimation under the high-dimensional asymptotic setting $n,p \rightarrow \infty$ and $n/p \rightarrow \delta$ ($\delta$ is a fixed number), and proves the consistency of three risk estimates that have been successful in numerical studies, i.e., leave-one-out cross validation (LOOCV), approximate leave-one-out (ALO), and approximate message passing (AMP)-based techniques. A corner stone of our analysis is a bound that we obtain on the discrepancy of the `residuals' obtained from AMP and LOOCV. This connection not only enables us to obtain a more refined information on the estimates of AMP, ALO, and LOOCV, but also offers an upper bound on the convergence rate of each estimate.