Gradient Descent
Underestimated Privacy Risks for Minority Populations in Large Language Model Unlearning
Wei, Rongzhe, Li, Mufei, Ghassemi, Mohsen, Kreaฤiฤ, Eleonora, Li, Yifan, Yue, Xiang, Li, Bo, Potluru, Vamsi K., Li, Pan, Chien, Eli
Large Language Models (LLMs) are trained on extensive datasets that often contain sensitive, human-generated information, raising significant concerns about privacy breaches. While certified unlearning approaches offer strong privacy guarantees, they rely on restrictive model assumptions that are not applicable to LLMs. As a result, various unlearning heuristics have been proposed, with the associated privacy risks assessed only empirically. The standard evaluation pipelines typically randomly select data for removal from the training set, apply unlearning techniques, and use membership inference attacks (MIAs) to compare the unlearned models against models retrained without the to-be-unlearned data. However, since every data point is subject to the right to be forgotten, unlearning should be considered in the worst-case scenario from the privacy perspective. Prior work shows that data outliers may exhibit higher memorization effects. Intuitively, they are harder to be unlearn and thus the privacy risk of unlearning them is overlooked and underestimated in the current evaluation. In this paper, we leverage minority data to identify such a critical flaw in previously widely adopted evaluations. We substantiate this claim through carefully designed experiments, including unlearning canaries related to minority groups, inspired by privacy auditing literature. Using personally identifiable information (PII) as a representative minority identifier, we demonstrate that minority groups experience at least 20% more privacy leakage in most cases across six unlearning approaches, three MIAs, three benchmark datasets, and two LLMs of different scales. Given that the right to be forgotten should be upheld for every individual, we advocate for a more rigorous evaluation of LLM unlearning methods. Our minority-aware evaluation framework represents an initial step toward ensuring more equitable and thorough assessments of LLM unlearning efficacy.
RealOSR: Latent Unfolding Boosting Diffusion-based Real-world Omnidirectional Image Super-Resolution
Sheng, Xuhan, Li, Runyi, Chen, Bin, Li, Weiqi, Jiang, Xu, Zhang, Jian
Omnidirectional image super-resolution (ODISR) aims to upscale low-resolution (LR) omnidirectional images (ODIs) to high-resolution (HR), addressing the growing demand for detailed visual content across a $180^{\circ}\times360^{\circ}$ viewport. Existing methods are limited by simple degradation assumptions (e.g., bicubic downsampling), which fail to capture the complex, unknown real-world degradation processes. Recent diffusion-based approaches suffer from slow inference due to their hundreds of sampling steps and frequent pixel-latent space conversions. To tackle these challenges, in this paper, we propose RealOSR, a novel diffusion-based approach for real-world ODISR (Real-ODISR) with single-step diffusion denoising. To sufficiently exploit the input information, RealOSR introduces a lightweight domain alignment module, which facilitates the efficient injection of LR ODI into the single-step latent denoising. Additionally, to better utilize the rich semantic and multi-scale feature modeling ability of denoising UNet, we develop a latent unfolding module that simulates the gradient descent process directly in latent space. Experimental results demonstrate that RealOSR outperforms previous methods in both ODI recovery quality and efficiency. Compared to the recent state-of-the-art diffusion-based ODISR method, OmniSSR, RealOSR achieves significant improvements in visual quality and over \textbf{200$\times$} inference acceleration. Our code and models will be released.
Sampling-based Continuous Optimization with Coupled Variables for RNA Design
Tang, Wei Yu, Dai, Ning, Zhou, Tianshuo, Mathews, David H., Huang, Liang
The task of RNA design given a target structure aims to find a sequence that can fold into that structure. It is a computationally hard problem where some version(s) have been proven to be NP-hard. As a result, heuristic methods such as local search have been popular for this task, but by only exploring a fixed number of candidates. They can not keep up with the exponential growth of the design space, and often perform poorly on longer and harder-to-design structures. We instead formulate these discrete problems as continuous optimization, which starts with a distribution over all possible candidate sequences, and uses gradient descent to improve the expectation of an objective function. We define novel distributions based on coupled variables to rule out invalid sequences given the target structure and to model the correlation between nucleotides. To make it universally applicable to any objective function, we use sampling to approximate the expected objective function, to estimate the gradient, and to select the final candidate. Compared to the state-of-the-art methods, our work consistently outperforms them in key metrics such as Boltzmann probability, ensemble defect, and energy gap, especially on long and hard-to-design puzzles in the Eterna100 benchmark. Our code is available at: http://github.com/weiyutang1010/ncrna_design.
Understanding Gradient Descent through the Training Jacobian
We examine the geometry of neural network training using the Jacobian of trained network parameters with respect to their initial values. Our analysis reveals low-dimensional structure in the training process which is dependent on the input data but largely independent of the labels. We find that the singular value spectrum of the Jacobian matrix consists of three distinctive regions: a "chaotic" region of values orders of magnitude greater than one, a large "bulk" region of values extremely close to one, and a "stable" region of values less than one. Along each bulk direction, the left and right singular vectors are nearly identical, indicating that perturbations to the initialization are carried through training almost unchanged. These perturbations have virtually no effect on the network's output in-distribution, yet do have an effect far out-of-distribution. While the Jacobian applies only locally around a single initialization, we find substantial overlap in bulk subspaces for different random seeds. Our code is available at https://github.com/EleutherAI/training-jacobian
Distributed Gradient Descent with Many Local Steps in Overparameterized Models
Zhu, Heng, Vardhan, Harsh, Mazumdar, Arya
In distributed training of machine learning models, gradient descent with local iterative steps is a very popular method, variants of which are commonly known as Local-SGD or the Federated Averaging (FedAvg). In this method, gradient steps based on local datasets are taken independently in distributed compute nodes to update the local models, which are then aggregated intermittently. Although the existing convergence analysis suggests that with heterogeneous data, FedAvg encounters quick performance degradation as the number of local steps increases, it is shown to work quite well in practice, especially in the distributed training of large language models. In this work we try to explain this good performance from a viewpoint of implicit bias in Local Gradient Descent (Local-GD) with a large number of local steps. In overparameterized regime, the gradient descent at each compute node would lead the model to a specific direction locally. We characterize the dynamics of the aggregated global model and compare it to the centralized model trained with all of the data in one place. In particular, we analyze the implicit bias of gradient descent on linear models, for both regression and classification tasks. Our analysis shows that the aggregated global model converges exactly to the centralized model for regression tasks, and converges (in direction) to the same feasible set as centralized model for classification tasks. We further propose a Modified Local-GD with a refined aggregation and theoretically show it converges to the centralized model in direction for linear classification. We empirically verified our theoretical findings in linear models and also conducted experiments on distributed fine-tuning of pretrained neural networks to further apply our theory.
Federated Split Learning with Model Pruning and Gradient Quantization in Wireless Networks
Zhang, Junhe, Ni, Wanli, Wang, Dongyu
As a paradigm of distributed machine learning, federated learning typically requires all edge devices to train a complete model locally. However, with the increasing scale of artificial intelligence models, the limited resources on edge devices often become a bottleneck for efficient fine-tuning. To address this challenge, federated split learning (FedSL) implements collaborative training across the edge devices and the server through model splitting. In this paper, we propose a lightweight FedSL scheme, that further alleviates the training burden on resource-constrained edge devices by pruning the client-side model dynamicly and using quantized gradient updates to reduce computation overhead. Additionally, we apply random dropout to the activation values at the split layer to reduce communication overhead. We conduct theoretical analysis to quantify the convergence performance of the proposed scheme. Finally, simulation results verify the effectiveness and advantages of the proposed lightweight FedSL in wireless network environments.
Criteria and Bias of Parameterized Linear Regression under Edge of Stability Regime
Classical optimization theory requires a small step-size for gradient-based methods to converge. Nevertheless, recent findings challenge the traditional idea by empirically demonstrating Gradient Descent (GD) converges even when the step-size $\eta$ exceeds the threshold of $2/L$, where $L$ is the global smooth constant. This is usually known as the Edge of Stability (EoS) phenomenon. A widely held belief suggests that an objective function with subquadratic growth plays an important role in incurring EoS. In this paper, we provide a more comprehensive answer by considering the task of finding linear interpolator $\beta \in R^{d}$ for regression with loss function $l(\cdot)$, where $\beta$ admits parameterization as $\beta = w^2_{+} - w^2_{-}$. Contrary to the previous work that suggests a subquadratic $l$ is necessary for EoS, our novel finding reveals that EoS occurs even when $l$ is quadratic under proper conditions. This argument is made rigorous by both empirical and theoretical evidence, demonstrating the GD trajectory converges to a linear interpolator in a non-asymptotic way. Moreover, the model under quadratic $l$, also known as a depth-$2$ diagonal linear network, remains largely unexplored under the EoS regime. Our analysis then sheds some new light on the implicit bias of diagonal linear networks when a larger step-size is employed, enriching the understanding of EoS on more practical models.
Ornstein-Uhlenbeck Adaptation as a Mechanism for Learning in Brains and Machines
Fernandez, Jesus Garcia, Ahmad, Nasir, van Gerven, Marcel
One of the main properties of any intelligent system is that it has the capacity to learn. This holds for biological systems, ranging from bacteria and fungi to plants and animals [17, 21, 42, 50], as well as for engineered systems designed by artificial intelligence (AI) researchers [59, 29, 9]. Modern intelligent systems, such as those used in machine learning, typically rely on gradient descent for learning by minimizing error gradients [32, 65, 49]. While gradient-based methods have driven significant advances in AI [29], their reliance on exact gradients, centralized updates, and complex information pathways limits their applicability in biological and neuromorphic systems In contrast, biological learning likely relies on different mechanisms, as organisms often lack the exact gradient information and centralized control that gradient descent requires [31, 68]. Neuromorphic computing, inspired by these principles, aims to replicate the distributed, energyefficient learning of biological systems [38, 40]. However, integrating traditional gradient-based methods into neuromorphic hardware has proven challenging, highlighting a critical gap: the need for gradient-free learning mechanisms that exclusively rely on operations that are local in space and time [24, 12]. To address this, alternative learning principles to gradient descent have been proposed for both rate-based [45, 7, 51, 5, 67] and spike-based models [36, 4, 22, 46]. A class of methods that leverages inherent noise present in biological systems to facilitate learning is perturbation-based methods [57, 69, 66], which adjust the system's parameters based on noise effects and global reinforcement signals, offering gradient-free, local learning suitable for biological or neuromorphic
Streaming Private Continual Counting via Binning
Andersson, Joel Daniel, Pagh, Rasmus
In differential privacy, $\textit{continual observation}$ refers to problems in which we wish to continuously release a function of a dataset that is revealed one element at a time. The challenge is to maintain a good approximation while keeping the combined output over all time steps differentially private. In the special case of $\textit{continual counting}$ we seek to approximate a sum of binary input elements. This problem has received considerable attention lately, in part due to its relevance in implementations of differentially private stochastic gradient descent. $\textit{Factorization mechanisms}$ are the leading approach to continual counting, but the best such mechanisms do not work well in $\textit{streaming}$ settings since they require space proportional to the size of the input. In this paper, we present a simple approach to approximating factorization mechanisms in low space via $\textit{binning}$, where adjacent matrix entries with similar values are changed to be identical in such a way that a matrix-vector product can be maintained in sublinear space. Our approach has provable sublinear space guarantees for a class of lower triangular matrices whose entries are monotonically decreasing away from the diagonal. We show empirically that even with very low space usage we are able to closely match, and sometimes surpass, the performance of asymptotically optimal factorization mechanisms. Recently, and independently of our work, Dvijotham et al. have also suggested an approach to implementing factorization mechanisms in a streaming setting. Their work differs from ours in several respects: It only addresses factorization into $\textit{Toeplitz}$ matrices, only considers $\textit{maximum}$ error, and uses a different technique based on rational function approximation that seems less versatile than our binning approach.
Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression
We study nonparametric regression by an over-parameterized two-layer neural network trained by gradient descent (GD) in this paper. We show that, if the neural network is trained by GD with early stopping, then the trained network renders a sharp rate of the nonparametric regression risk of $\cO(\eps_n^2)$, which is the same rate as that for the classical kernel regression trained by GD with early stopping, where $\eps_n$ is the critical population rate of the Neural Tangent Kernel (NTK) associated with the network and $n$ is the size of the training data. It is remarked that our result does not require distributional assumptions about the covariate as long as the covariate is bounded, in a strong contrast with many existing results which rely on specific distributions of the covariates such as the spherical uniform data distribution or distributions satisfying certain restrictive conditions. The rate $\cO(\eps_n^2)$ is known to be minimax optimal for specific cases, such as the case that the NTK has a polynomial eigenvalue decay rate which happens under certain distributional assumptions on the covariates. Our result formally fills the gap between training a classical kernel regression model and training an over-parameterized but finite-width neural network by GD for nonparametric regression without distributional assumptions on the bounded covariate. We also provide confirmative answers to certain open questions or address particular concerns in the literature of training over-parameterized neural networks by GD with early stopping for nonparametric regression, including the characterization of the stopping time, the lower bound for the network width, and the constant learning rate used in GD.