AITopics

Large Language Models (LLMs) are trained on extensive datasets that often contain sensitive, human-generated information, raising significant concerns about privacy breaches. While certified unlearning approaches offer strong privacy guarantees, they rely on restrictive model assumptions that are not applicable to LLMs. As a result, various unlearning heuristics have been proposed, with the associated privacy risks assessed only empirically. The standard evaluation pipelines typically randomly select data for removal from the training set, apply unlearning techniques, and use membership inference attacks (MIAs) to compare the unlearned models against models retrained without the to-be-unlearned data. However, since every data point is subject to the right to be forgotten, unlearning should be considered in the worst-case scenario from the privacy perspective. Prior work shows that data outliers may exhibit higher memorization effects. Intuitively, they are harder to be unlearn and thus the privacy risk of unlearning them is overlooked and underestimated in the current evaluation. In this paper, we leverage minority data to identify such a critical flaw in previously widely adopted evaluations. We substantiate this claim through carefully designed experiments, including unlearning canaries related to minority groups, inspired by privacy auditing literature. Using personally identifiable information (PII) as a representative minority identifier, we demonstrate that minority groups experience at least 20% more privacy leakage in most cases across six unlearning approaches, three MIAs, three benchmark datasets, and two LLMs of different scales. Given that the right to be forgotten should be upheld for every individual, we advocate for a more rigorous evaluation of LLM unlearning methods. Our minority-aware evaluation framework represents an initial step toward ensuring more equitable and thorough assessments of LLM unlearning efficacy.

large language model, machine learning, natural language, (16 more...)

2412.08559

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Energy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

RealOSR: Latent Unfolding Boosting Diffusion-based Real-world Omnidirectional Image Super-Resolution

Sheng, Xuhan, Li, Runyi, Chen, Bin, Li, Weiqi, Jiang, Xu, Zhang, Jian

Omnidirectional image super-resolution (ODISR) aims to upscale low-resolution (LR) omnidirectional images (ODIs) to high-resolution (HR), addressing the growing demand for detailed visual content across a $180^{\circ}\times360^{\circ}$ viewport. Existing methods are limited by simple degradation assumptions (e.g., bicubic downsampling), which fail to capture the complex, unknown real-world degradation processes. Recent diffusion-based approaches suffer from slow inference due to their hundreds of sampling steps and frequent pixel-latent space conversions. To tackle these challenges, in this paper, we propose RealOSR, a novel diffusion-based approach for real-world ODISR (Real-ODISR) with single-step diffusion denoising. To sufficiently exploit the input information, RealOSR introduces a lightweight domain alignment module, which facilitates the efficient injection of LR ODI into the single-step latent denoising. Additionally, to better utilize the rich semantic and multi-scale feature modeling ability of denoising UNet, we develop a latent unfolding module that simulates the gradient descent process directly in latent space. Experimental results demonstrate that RealOSR outperforms previous methods in both ODI recovery quality and efficiency. Compared to the recent state-of-the-art diffusion-based ODISR method, OmniSSR, RealOSR achieves significant improvements in visual quality and over \textbf{200$\times$} inference acceleration. Our code and models will be released.

artificial intelligence, machine learning, proceedings, (18 more...)

2412.09646

Country: Asia > China (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Sampling-based Continuous Optimization with Coupled Variables for RNA Design

Tang, Wei Yu, Dai, Ning, Zhou, Tianshuo, Mathews, David H., Huang, Liang

The task of RNA design given a target structure aims to find a sequence that can fold into that structure. It is a computationally hard problem where some version(s) have been proven to be NP-hard. As a result, heuristic methods such as local search have been popular for this task, but by only exploring a fixed number of candidates. They can not keep up with the exponential growth of the design space, and often perform poorly on longer and harder-to-design structures. We instead formulate these discrete problems as continuous optimization, which starts with a distribution over all possible candidate sequences, and uses gradient descent to improve the expectation of an objective function. We define novel distributions based on coupled variables to rule out invalid sequences given the target structure and to model the correlation between nucleotides. To make it universally applicable to any objective function, we use sampling to approximate the expected objective function, to estimate the gradient, and to select the final candidate. Compared to the state-of-the-art methods, our work consistently outperforms them in key metrics such as Boltzmann probability, ensemble defect, and energy gap, especially on long and hard-to-design puzzles in the Eterna100 benchmark. Our code is available at: http://github.com/weiyutang1010/ncrna_design.

artificial intelligence, machine learning, puzzle, (15 more...)

2412.08751

Country:

Europe > Austria > Vienna (0.04)
North America > United States > Oregon (0.04)
North America > United States > California (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Belrose, Nora, Scherlis, Adam

Understanding Gradient Descent through the Training Jacobian

We examine the geometry of neural network training using the Jacobian of trained network parameters with respect to their initial values. Our analysis reveals low-dimensional structure in the training process which is dependent on the input data but largely independent of the labels. We find that the singular value spectrum of the Jacobian matrix consists of three distinctive regions: a "chaotic" region of values orders of magnitude greater than one, a large "bulk" region of values extremely close to one, and a "stable" region of values less than one. Along each bulk direction, the left and right singular vectors are nearly identical, indicating that perturbations to the initialization are carried through training almost unchanged. These perturbations have virtually no effect on the network's output in-distribution, yet do have an effect far out-of-distribution. While the Jacobian applies only locally around a single initialization, we find substantial overlap in bulk subspaces for different random seeds. Our code is available at https://github.com/EleutherAI/training-jacobian

jacobian, singular vector, subspace, (16 more...)

2412.07003

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.51)

Zhu, Heng, Vardhan, Harsh, Mazumdar, Arya

Distributed Gradient Descent with Many Local Steps in Overparameterized Models

arXiv.org Machine LearningDec-10-2024

In distributed training of machine learning models, gradient descent with local iterative steps is a very popular method, variants of which are commonly known as Local-SGD or the Federated Averaging (FedAvg). In this method, gradient steps based on local datasets are taken independently in distributed compute nodes to update the local models, which are then aggregated intermittently. Although the existing convergence analysis suggests that with heterogeneous data, FedAvg encounters quick performance degradation as the number of local steps increases, it is shown to work quite well in practice, especially in the distributed training of large language models. In this work we try to explain this good performance from a viewpoint of implicit bias in Local Gradient Descent (Local-GD) with a large number of local steps. In overparameterized regime, the gradient descent at each compute node would lead the model to a specific direction locally. We characterize the dynamics of the aggregated global model and compare it to the centralized model trained with all of the data in one place. In particular, we analyze the implicit bias of gradient descent on linear models, for both regression and classification tasks. Our analysis shows that the aggregated global model converges exactly to the centralized model for regression tasks, and converges (in direction) to the same feasible set as centralized model for classification tasks. We further propose a Modified Local-GD with a refined aggregation and theoretically show it converges to the centralized model in direction for linear classification. We empirically verified our theoretical findings in linear models and also conducted experiments on distributed fine-tuning of pretrained neural networks to further apply our theory.

artificial intelligence, local-gd, machine learning, (16 more...)

arXiv.org Machine Learning

2412.07971

Country:

North America > United States > Virginia (0.04)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

arXiv.org Artificial IntelligenceDec-10-2024

Federated Split Learning with Model Pruning and Gradient Quantization in Wireless Networks

Zhang, Junhe, Ni, Wanli, Wang, Dongyu

As a paradigm of distributed machine learning, federated learning typically requires all edge devices to train a complete model locally. However, with the increasing scale of artificial intelligence models, the limited resources on edge devices often become a bottleneck for efficient fine-tuning. To address this challenge, federated split learning (FedSL) implements collaborative training across the edge devices and the server through model splitting. In this paper, we propose a lightweight FedSL scheme, that further alleviates the training burden on resource-constrained edge devices by pruning the client-side model dynamicly and using quantized gradient updates to reduce computation overhead. Additionally, we apply random dropout to the activation values at the split layer to reduce communication overhead. We conduct theoretical analysis to quantify the convergence performance of the proposed scheme. Finally, simulation results verify the effectiveness and advantages of the proposed lightweight FedSL in wireless network environments.

activation value, artificial intelligence, machine learning, (16 more...)

doi: 10.1109/TVT.2024.3515083

2412.06414

Country: Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Zhang, Peiyuan, Karbasi, Amin

Criteria and Bias of Parameterized Linear Regression under Edge of Stability Regime

arXiv.org Machine LearningDec-10-2024

Classical optimization theory requires a small step-size for gradient-based methods to converge. Nevertheless, recent findings challenge the traditional idea by empirically demonstrating Gradient Descent (GD) converges even when the step-size $\eta$ exceeds the threshold of $2/L$, where $L$ is the global smooth constant. This is usually known as the Edge of Stability (EoS) phenomenon. A widely held belief suggests that an objective function with subquadratic growth plays an important role in incurring EoS. In this paper, we provide a more comprehensive answer by considering the task of finding linear interpolator $\beta \in R^{d}$ for regression with loss function $l(\cdot)$, where $\beta$ admits parameterization as $\beta = w^2_{+} - w^2_{-}$. Contrary to the previous work that suggests a subquadratic $l$ is necessary for EoS, our novel finding reveals that EoS occurs even when $l$ is quadratic under proper conditions. This argument is made rigorous by both empirical and theoretical evidence, demonstrating the GD trajectory converges to a linear interpolator in a non-asymptotic way. Moreover, the model under quadratic $l$, also known as a depth-$2$ diagonal linear network, remains largely unexplored under the EoS regime. Our analysis then sheds some new light on the implicit bias of diagonal linear networks when a larger step-size is employed, enriching the understanding of EoS on more practical models.

convergence, iteration, regime, (13 more...)

arXiv.org Machine Learning

2412.08025

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.50)

Fernandez, Jesus Garcia, Ahmad, Nasir, van Gerven, Marcel

Ornstein-Uhlenbeck Adaptation as a Mechanism for Learning in Brains and Machines

arXiv.org Artificial IntelligenceDec-9-2024

One of the main properties of any intelligent system is that it has the capacity to learn. This holds for biological systems, ranging from bacteria and fungi to plants and animals [17, 21, 42, 50], as well as for engineered systems designed by artificial intelligence (AI) researchers [59, 29, 9]. Modern intelligent systems, such as those used in machine learning, typically rely on gradient descent for learning by minimizing error gradients [32, 65, 49]. While gradient-based methods have driven significant advances in AI [29], their reliance on exact gradients, centralized updates, and complex information pathways limits their applicability in biological and neuromorphic systems In contrast, biological learning likely relies on different mechanisms, as organisms often lack the exact gradient information and centralized control that gradient descent requires [31, 68]. Neuromorphic computing, inspired by these principles, aims to replicate the distributed, energyefficient learning of biological systems [38, 40]. However, integrating traditional gradient-based methods into neuromorphic hardware has proven challenging, highlighting a critical gap: the need for gradient-free learning mechanisms that exclusively rely on operations that are local in space and time [24, 12]. To address this, alternative learning principles to gradient descent have been proposed for both rate-based [45, 7, 51, 5, 67] and spike-based models [36, 4, 22, 46]. A class of methods that leverages inherent noise present in biological systems to facilitate learning is perturbation-based methods [57, 69, 66], which adjust the system's parameters based on noise effects and global reinforcement signals, offering gradient-free, local learning suitable for biological or neuromorphic

artificial intelligence, machine learning, noise, (14 more...)

2410.13563

Country: Europe > United Kingdom > England (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.94)
Energy > Oil & Gas (0.68)
Education (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Andersson, Joel Daniel, Pagh, Rasmus

Streaming Private Continual Counting via Binning

arXiv.org Artificial IntelligenceDec-9-2024

In differential privacy, $\textit{continual observation}$ refers to problems in which we wish to continuously release a function of a dataset that is revealed one element at a time. The challenge is to maintain a good approximation while keeping the combined output over all time steps differentially private. In the special case of $\textit{continual counting}$ we seek to approximate a sum of binary input elements. This problem has received considerable attention lately, in part due to its relevance in implementations of differentially private stochastic gradient descent. $\textit{Factorization mechanisms}$ are the leading approach to continual counting, but the best such mechanisms do not work well in $\textit{streaming}$ settings since they require space proportional to the size of the input. In this paper, we present a simple approach to approximating factorization mechanisms in low space via $\textit{binning}$, where adjacent matrix entries with similar values are changed to be identical in such a way that a matrix-vector product can be maintained in sublinear space. Our approach has provable sublinear space guarantees for a class of lower triangular matrices whose entries are monotonically decreasing away from the diagonal. We show empirically that even with very low space usage we are able to closely match, and sometimes surpass, the performance of asymptotically optimal factorization mechanisms. Recently, and independently of our work, Dvijotham et al. have also suggested an approach to implementing factorization mechanisms in a streaming setting. Their work differs from ours in several respects: It only addresses factorization into $\textit{Toeplitz}$ matrices, only considers $\textit{maximum}$ error, and uses a different technique based on rational function approximation that seems less versatile than our binning approach.

artificial intelligence, machine learning, matrix, (17 more...)

2412.07093

Country:

Europe > Denmark > Capital Region > Copenhagen (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Nevada (0.04)
(2 more...)

Genre:

Research Report (0.63)
Workflow (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

arXiv.org Machine LearningDec-9-2024

Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression

Yang, Yingzhen, Li, Ping

We study nonparametric regression by an over-parameterized two-layer neural network trained by gradient descent (GD) in this paper. We show that, if the neural network is trained by GD with early stopping, then the trained network renders a sharp rate of the nonparametric regression risk of $\cO(\eps_n^2)$, which is the same rate as that for the classical kernel regression trained by GD with early stopping, where $\eps_n$ is the critical population rate of the Neural Tangent Kernel (NTK) associated with the network and $n$ is the size of the training data. It is remarked that our result does not require distributional assumptions about the covariate as long as the covariate is bounded, in a strong contrast with many existing results which rely on specific distributions of the covariates such as the spherical uniform data distribution or distributions satisfying certain restrictive conditions. The rate $\cO(\eps_n^2)$ is known to be minimax optimal for specific cases, such as the case that the NTK has a polynomial eigenvalue decay rate which happens under certain distributional assumptions on the covariates. Our result formally fills the gap between training a classical kernel regression model and training an over-parameterized but finite-width neural network by GD for nonparametric regression without distributional assumptions on the bounded covariate. We also provide confirmative answers to certain open questions or address particular concerns in the literature of training over-parameterized neural networks by GD with early stopping for nonparametric regression, including the characterization of the stopping time, the lower bound for the network width, and the constant learning rate used in GD.

neural network, probability, theorem vi, (15 more...)

arXiv.org Machine Learning

2411.02904

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Austria (0.04)
North America > United States > Washington > King County > Bellevue (0.04)
(5 more...)

Genre: Research Report > New Finding (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)