arXiv.org Machine Learning
An Overview of Low-Rank Structures in the Training and Adaptation of Large Models
Balzano, Laura, Ding, Tianjiao, Haeffele, Benjamin D., Kwon, Soo Min, Qu, Qing, Wang, Peng, Wang, Zhangyang, Yaras, Can
The rise of deep learning has revolutionized data processing and prediction in signal processing and machine learning, yet the substantial computational demands of training and deploying modern large-scale deep models present significant challenges, including high computational costs and energy consumption. Recent research has uncovered a widespread phenomenon in deep networks: the emergence of low-rank structures in weight matrices and learned representations during training. These implicit low-dimensional patterns provide valuable insights for improving the efficiency of training and fine-tuning large-scale models. Practical techniques inspired by this phenomenon--such as low-rank adaptation (LoRA) and training--enable significant reductions in computational cost while preserving model performance. In this paper, we present a comprehensive review of recent advances in exploiting low-rank structures for deep learning and shed light on their mathematical foundations. Mathematically, we present two complementary perspectives on understanding the low-rankness in deep networks: (i) the emergence of low-rank structures throughout the whole optimization dynamics of gradient and (ii) the implicit regularization effects that induce such low-rank structures at convergence. From a practical standpoint, studying the low-rank learning dynamics of gradient descent offers a mathematical foundation for understanding the effectiveness of LoRA in fine-tuning large-scale models and inspires parameter-efficient low-rank training strategies. Furthermore, the implicit low-rank regularization effect helps explain the success of various masked training approaches in deep neural networks, ranging from dropout to masked self-supervised learning. In summary, this tutorial provides researchers and practitioners with a deeper understanding of low-rank structures in the training and adaptation of large-scale deep learning models, highlighting both the theoretical foundations and practical applications of low-rank methods, and outlining promising directions for future research.
Extensions of regret-minimization algorithm for optimal design
We explore extensions and applications of the regret minimization framework introduced by~\cite{design} for solving optimal experimental design problems. Specifically, we incorporate the entropy regularizer into this framework, leading to a novel sample selection objective and a provable sample complexity bound that guarantees a $(1+\epsilon)$-near optimal solution. We further extend the method to handle regularized optimal design settings. As an application, we use our algorithm to select a small set of representative samples from image classification datasets without relying on label information. To evaluate the quality of the selected samples, we train a logistic regression model and compare performance against several baseline sampling strategies. Experimental results on MNIST, CIFAR-10, and a 50-class subset of ImageNet show that our approach consistently outperforms competing methods in most cases.
On the Robustness of Kernel Ridge Regression Using the Cauchy Loss Function
Wen, Hongwei, Betken, Annika, Koolen, Wouter
Robust regression aims to develop methods for estimating an unknown regression function in the presence of outliers, heavy-tailed distributions, or contaminated data, which can severely impact performance. Most existing theoretical results in robust regression assume that the noise has a finite absolute mean, an assumption violated by certain distributions, such as Cauchy and some Pareto noise. In this paper, we introduce a generalized Cauchy noise framework that accommodates all noise distributions with finite moments of any order, even when the absolute mean is infinite. Within this framework, we study the \textit{kernel Cauchy ridge regressor} (\textit{KCRR}), which minimizes a regularized empirical Cauchy risk to achieve robustness. To derive the $L_2$-risk bound for KCRR, we establish a connection between the excess Cauchy risk and $L_2$-risk for sufficiently large scale parameters of the Cauchy loss, which reveals that these two risks are equivalent. Furthermore, under the assumption that the regression function satisfies H\"older smoothness, we derive excess Cauchy risk bounds for KCRR, showing improved performance as the scale parameter decreases. By considering the twofold effect of the scale parameter on the excess Cauchy risk and its equivalence with the $L_2$-risk, we establish the almost minimax-optimal convergence rate for KCRR in terms of $L_2$-risk, highlighting the robustness of the Cauchy loss in handling various types of noise. Finally, we validate the effectiveness of KCRR through experiments on both synthetic and real-world datasets under diverse noise corruption scenarios.
Deep learning-based identification of precipitation clouds from all-sky camera data for observatory safety
Haghighi, Mohammad H. Zhoolideh, Ghasrimanesh, Alireza, Khosroshahi, Habib
For monitoring the night sky conditions, wide-angle all-sky cameras are used in most astronomical observatories to monitor the sky cloudiness. In this manuscript, we apply a deep-learning approach for automating the identification of precipitation clouds in all-sky camera data as a cloud warning system. We construct our original training and test sets using the all-sky camera image archive of the Iranian National Observatory (INO). The training and test set images are labeled manually based on their potential rainfall and their distribution in the sky. We train our model on a set of roughly 2445 images taken by the INO all-sky camera through the deep learning method based on the EfficientNet network. Our model reaches an average accuracy of 99\% in determining the cloud rainfall's potential and an accuracy of 96\% for cloud coverage. To enable a comprehensive comparison and evaluate the performance of alternative architectures for the task, we additionally trained three models LeNet, DeiT, and AlexNet. This approach can be used for early warning of incoming dangerous clouds toward telescopes and harnesses the power of deep learning to automatically analyze vast amounts of all-sky camera data and accurately identify precipitation clouds formations. Our trained model can be deployed for real-time analysis, enabling the rapid identification of potential threats, and offering a scaleable solution that can improve our ability to safeguard telescopes and instruments in observatories. This is important now that numerous small and medium-sized telescopes are increasingly integrated with smart control systems to reduce manual operation.
Finite-Time Bounds for Two-Time-Scale Stochastic Approximation with Arbitrary Norm Contractions and Markovian Noise
Chandak, Siddharth, Haque, Shaan Ul, Bambos, Nicholas
Two-time-scale Stochastic Approximation (SA) is an iterative algorithm with applications in reinforcement learning and optimization. Prior finite time analysis of such algorithms has focused on fixed point iterations with mappings contractive under Euclidean norm. Motivated by applications in reinforcement learning, we give the first mean square bound on non linear two-time-scale SA where the iterations have arbitrary norm contractive mappings and Markovian noise. We show that the mean square error decays at a rate of $O(1/n^{2/3})$ in the general case, and at a rate of $O(1/n)$ in a special case where the slower timescale is noiseless. Our analysis uses the generalized Moreau envelope to handle the arbitrary norm contractions and solutions of Poisson equation to deal with the Markovian noise. By analyzing the SSP Q-Learning algorithm, we give the first $O(1/n)$ bound for an algorithm for asynchronous control of MDPs under the average reward criterion. We also obtain a rate of $O(1/n)$ for Q-Learning with Polyak-averaging and provide an algorithm for learning Generalized Nash Equilibrium (GNE) for strongly monotone games which converges at a rate of $O(1/n^{2/3})$.
Learning a Class of Mixed Linear Regressions: Global Convergence under General Data Conditions
Liu, Yujing, Liu, Zhixin, Guo, Lei
Mixed linear regression (MLR) has attracted increasing attention because of its great theoretical and practical importance in capturing nonlinear relationships by utilizing a mixture of linear regression sub-models. Although considerable efforts have been devoted to the learning problem of such systems, i.e., estimating data labels and identifying model parameters, most existing investigations employ the offline algorithm, impose the strict independent and identically distributed (i.i.d.) or persistent excitation (PE) conditions on the regressor data, and provide local convergence results only. In this paper, we investigate the recursive estimation and data clustering problems for a class of stochastic MLRs with two components. To address this inherently nonconvex optimization problem, we propose a novel two-step recursive identification algorithm to estimate the true parameters, where the direction vector and the scaling coefficient of the unknown parameters are estimated by the least squares and the expectation-maximization (EM) principles, respectively. Under a general data condition, which is much weaker than the traditional i.i.d. and PE conditions, we establish the global convergence and the convergence rate of the proposed identification algorithm for the first time. Furthermore, we prove that, without any excitation condition on the regressor data, the data clustering performance including the cumulative mis-classification error and the within-cluster error can be optimal asymptotically. Finally, we provide a numerical example to illustrate the performance of the proposed learning algorithm.
Differentially Private Joint Independence Test
Liu, Xingwei, Chen, Yuexin, Xu, Wangli
Identification of joint dependence among more than two random vectors plays an important role in many statistical applications, where the data may contain sensitive or confidential information. In this paper, we consider the the d-variable Hilbert-Schmidt independence criterion (dHSIC) in the context of differential privacy. Given the limiting distribution of the empirical estimate of dHSIC is complicated Gaussian chaos, constructing tests in the non-privacy regime is typically based on permutation and bootstrap. To detect joint dependence in privacy, we propose a dHSIC-based testing procedure by employing a differentially private permutation methodology. Our method enjoys privacy guarantee, valid level and pointwise consistency, while the bootstrap counterpart suffers inconsistent power. We further investigate the uniform power of the proposed test in dHSIC metric and $L_2$ metric, indicating that the proposed test attains the minimax optimal power across different privacy regimes. As a byproduct, our results also contain the pointwise and uniform power of the non-private permutation dHSIC, addressing an unsolved question remained in Pfister et al. (2018).
Dynamically Learning to Integrate in Recurrent Neural Networks
Bordelon, Blake, Cotler, Jordan, Pehlevan, Cengiz, Zavatone-Veth, Jacob A.
Learning to remember over long timescales is fundamentally challenging for recurrent neural networks (RNNs). While much prior work has explored why RNNs struggle to learn long timescales and how to mitigate this, we still lack a clear understanding of the dynamics involved when RNNs learn long timescales via gradient descent. Here we build a mathematical theory of the learning dynamics of linear RNNs trained to integrate white noise. We show that when the initial recurrent weights are small, the dynamics of learning are described by a low-dimensional system that tracks a single outlier eigenvalue of the recurrent weights. This reveals the precise manner in which the long timescale associated with white noise integration is learned. We extend our analyses to RNNs learning a damped oscillatory filter, and find rich dynamical equations for the evolution of a conjugate pair of outlier eigenvalues. Taken together, our analyses build a rich mathematical framework for studying dynamical learning problems salient for both machine learning and neuroscience.
Local Interference: Removing Interference Bias in Semi-Parametric Causal Models
O'Riordan, Michael, Gilligan-Lee, Ciarán M.
Interference bias is a major impediment to identifying causal effects in real-world settings. For example, vaccination reduces the transmission of a virus in a population such that everyone benefits -- even those who are not treated. This is a source of bias that must be accounted for if one wants to learn the true effect of a vaccine on an individual's immune system. Previous approaches addressing interference bias require strong domain knowledge in the form of a graphical interaction network fully describing interference between units. Moreover, they place additional constraints on the form the interference can take, such as restricting to linear outcome models, and assuming that interference experienced by a unit does not depend on the unit's covariates. Our work addresses these shortcomings. We first provide and justify a novel definition of causal models with local interference. We prove that the True Average Causal Effect, a measure of causality where interference has been removed, can be identified in certain semi-parametric models satisfying this definition. These models allow for non-linearity, and also for interference to depend on a unit's covariates. An analytic estimand for the True Average Causal Effect is given in such settings. We further prove that the True Average Causal Effect cannot be identified in arbitrary models with local interference, showing that identification requires semi-parametric assumptions. Finally, we provide an empirical validation of our method on both simulated and real-world datasets.
Structured and sparse partial least squares coherence for multivariate cortico-muscular analysis
Sun, Jingyao, Zhang, Qilu, Ma, Di, Jia, Tianyu, Jia, Shijie, Zhai, Xiaoxue, Xie, Ruimou, Lin, Ping-Ju, Li, Zhibin, Pan, Yu, Ji, Linhong, Li, Chong
Multivariate cortico-muscular analysis has recently emerged as a promising approach for evaluating the corticospinal neural pathway. However, current multivariate approaches encounter challenges such as high dimensionality and limited sample sizes, thus restricting their further applications. In this paper, we propose a structured and sparse partial least squares coherence algorithm (ssPLSC) to extract shared latent space representations related to cortico-muscular interactions. Our approach leverages an embedded optimization framework by integrating a partial least squares (PLS)-based objective function, a sparsity constraint and a connectivity-based structured constraint, addressing the generalizability, interpretability and spatial structure. To solve the optimization problem, we develop an efficient alternating iterative algorithm within a unified framework and prove its convergence experimentally. Extensive experimental results from one synthetic and several real-world datasets have demonstrated that ssPLSC can achieve competitive or better performance over some representative multivariate cortico-muscular fusion methods, particularly in scenarios characterized by limited sample sizes and high noise levels. This study provides a novel multivariate fusion method for cortico-muscular analysis, offering a transformative tool for the evaluation of corticospinal pathway integrity in neurological disorders.