AITopics

2412.09498

Country: North America > United States > California > Los Angeles County > Los Angeles (0.27)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

arXiv.org Machine LearningJan-4-2025

A ghost mechanism: An analytical model of abrupt learning

Dinc, Fatih, Cirakman, Ege, Jiang, Yiqi, Yuksekgonul, Mert, Schnitzer, Mark J., Tanaka, Hidenori

\emph{Abrupt learning} is commonly observed in neural networks, where long plateaus in network performance are followed by rapid convergence to a desirable solution. Yet, despite its common occurrence, the complex interplay of task, network architecture, and learning rule has made it difficult to understand the underlying mechanisms. Here, we introduce a minimal dynamical system trained on a delayed-activation task and demonstrate analytically how even a one-dimensional system can exhibit abrupt learning through ghost points rather than bifurcations. Through our toy model, we show that the emergence of a ghost point destabilizes learning dynamics. We identify a critical learning rate that prevents learning through two distinct loss landscape features: a no-learning zone and an oscillatory minimum. Testing these predictions in recurrent neural networks (RNNs), we confirm that ghost points precede abrupt learning and accompany the destabilization of learning. We demonstrate two complementary remedies: lowering the model output confidence prevents the network from getting stuck in no-learning zones, while increasing trainable ranks beyond task requirements (\textit{i.e.}, adding sloppy parameters) provides more stable learning trajectories. Our model reveals a bifurcation-free mechanism for abrupt learning and illustrates the importance of both deliberate uncertainty and redundancy in stabilizing learning dynamics.

artificial intelligence, machine learning, toy model, (17 more...)

2501.02378

Country:

North America > United States > California > Santa Clara County > Stanford (0.04)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
North America > United States > New York (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine (0.69)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.47)

arXiv.org Machine LearningJan-3-2025

Guaranteed Nonconvex Low-Rank Tensor Estimation via Scaled Gradient Descent

Wu, Tong

Tensors, which give a faithful and effective representation to deliver the intrinsic structure of multi-dimensional data, play a crucial role in an increasing number of signal processing and machine learning problems. However, tensor data are often accompanied by arbitrary signal corruptions, including missing entries and sparse noise. A fundamental challenge is to reliably extract the meaningful information from corrupted tensor data in a statistically and computationally efficient manner. This paper develops a scaled gradient descent (ScaledGD) algorithm to directly estimate the tensor factors with tailored spectral initializations under the tensor-tensor product (t-product) and tensor singular value decomposition (t-SVD) framework. In theory, ScaledGD achieves linear convergence at a constant rate that is independent of the condition number of the ground truth low-rank tensor for two canonical problems -- tensor robust principal component analysis and tensor completion -- as long as the level of corruptions is not too large and the sample size is sufficiently large, while maintaining the low per-iteration cost of gradient descent. To the best of our knowledge, ScaledGD is the first algorithm that provably has such properties for low-rank tensor estimation with the t-SVD decomposition. Finally, numerical examples are provided to demonstrate the efficacy of ScaledGD in accelerating the convergence rate of ill-conditioned low-rank tensor estimation in these two applications.

artificial intelligence, machine learning, tensor, (18 more...)

2501.01696

Genre: Research Report (0.49)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.90)

Roy, Priyanka, Saminger-Platz, Susanne

Upper Bounds for Learning in Reproducing Kernel Hilbert Spaces for Non IID Samples

arXiv.org Machine LearningJan-2-2025

In this paper, we study a Markov chain-based stochastic gradient algorithm in general Hilbert spaces, aiming to approximate the optimal solution of a quadratic loss function. We establish probabilistic upper bounds on its convergence. We further extend these results to an online regularized learning algorithm in reproducing kernel Hilbert spaces, where the samples are drawn along a Markov chain trajectory hence the samples are of the non i.i.d.

algorithm, markov chain, markov chain trajectory, (14 more...)

2410.08361

Country:

North America > United States > New York (0.04)
Europe > Austria > Upper Austria > Linz (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

arXiv.org Artificial IntelligenceJan-1-2025

How to explain grokking

Kozyrev, S. V.

Simple ideas of thermodynamics and kinetic theory allow us to explain better generalization observed for learning by the stochastic gradient optimization procedure, see also [7] (where also overfitting control for GAN model was discussed). We also have explained the grokking(delayed generalization) phenomenon and some properties of grokking observed in [8].

entropy, free energy, training sample, (16 more...)

2412.18624

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.38)

Chae, Jiseok, Yun, Chulhee, Kim, Donghwan

Stochastic Extragradient with Flip-Flop Shuffling & Anchoring: Provable Improvements

arXiv.org Artificial IntelligenceDec-31-2024

In minimax optimization, the extragradient (EG) method has been extensively studied because it outperforms the gradient descent-ascent method in convex-concave (C-C) problems. Yet, stochastic EG (SEG) has seen limited success in C-C problems, especially for unconstrained cases. Motivated by the recent progress of shuffling-based stochastic methods, we investigate the convergence of shuffling-based SEG in unconstrained finite-sum minimax problems, in search of convergent shuffling-based SEG. Our analysis reveals that both random reshuffling and the recently proposed flip-flop shuffling alone can suffer divergence in C-C problems. However, with an additional simple trick called anchoring, we develop the SEG with flip-flop anchoring (SEG-FFA) method which successfully converges in C-C problems. We also show upper and lower bounds in the strongly-convex-strongly-concave setting, demonstrating that SEG-FFA has a provably faster convergence rate compared to other shuffling-based methods.

artificial intelligence, fz 0, machine learning, (15 more...)

2501.00511

Country: Asia (0.27)

Genre: Research Report > Experimental Study (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Wild, Romina, Wodaczek, Felix, Del Tatto, Vittorio, Cheng, Bingqing, Laio, Alessandro

Automatic feature selection and weighting in molecular systems using Differentiable Information Imbalance

arXiv.org Machine LearningDec-30-2024

Feature selection is essential in the analysis of molecular systems and many other fields, but several uncertainties remain: What is the optimal number of features for a simplified, interpretable model that retains essential information? How should features with different units be aligned, and how should their relative importance be weighted? Here, we introduce the Differentiable Information Imbalance (DII), an automated method to rank information content between sets of features. Using distances in a ground truth feature space, DII identifies a low-dimensional subset of features that best preserves these relationships. Each feature is scaled by a weight, which is optimized by minimizing the DII through gradient descent. This allows simultaneously performing unit alignment and relative importance scaling, while preserving interpretability. DII can also produce sparse solutions and determine the optimal size of the reduced feature space. We demonstrate the usefulness of this approach on two benchmark molecular problems: (1) identifying collective variables that describe conformations of a biomolecule, and (2) selecting features for training a machine-learning force field. These results show the potential of DII in addressing feature selection challenges and optimizing dimensionality in various applications. The method is available in the Python library DADApy.

dii, feature selection, ground truth, (12 more...)

2411.00851

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > California > Alameda County > Berkeley (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
(6 more...)

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Mahmoudi, Afsaneh, Xiao, Ming, Björnson, Emil

Accelerating Energy-Efficient Federated Learning in Cell-Free Networks with Adaptive Quantization

arXiv.org Artificial IntelligenceDec-30-2024

Federated Learning (FL) enables clients to share learning parameters instead of local data, reducing communication overhead. Traditional wireless networks face latency challenges with FL. In contrast, Cell-Free Massive MIMO (CFmMIMO) can serve multiple clients on shared resources, boosting spectral efficiency and reducing latency for large-scale FL. However, clients' communication resource limitations can hinder the completion of the FL training. To address this challenge, we propose an energy-efficient, low-latency FL framework featuring optimized uplink power allocation for seamless client-server collaboration. Our framework employs an adaptive quantization scheme, dynamically adjusting bit allocation for local gradient updates to reduce communication costs. We formulate a joint optimization problem covering FL model updates, local iterations, and power allocation, solved using sequential quadratic programming (SQP) to balance energy and latency. Additionally, clients use the AdaDelta method for local FL model updates, enhancing local model convergence compared to standard SGD, and we provide a comprehensive analysis of FL convergence with AdaDelta local updates. Numerical results show that, within the same energy and latency budgets, our power allocation scheme outperforms the Dinkelbach and max-sum rate methods by increasing the test accuracy up to $7$\% and $19$\%, respectively. Moreover, for the three power allocation methods, our proposed quantization scheme outperforms AQUILA and LAQ by increasing test accuracy by up to $36$\% and $35$\%, respectively.

artificial intelligence, iteration, machine learning, (15 more...)

2412.20785

Country: Europe (0.28)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Refael, Yehonathan, Svirsky, Jonathan, Shustin, Boris, Huleihel, Wasim, Lindenbaum, Ofir

AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning

arXiv.org Artificial IntelligenceDec-29-2024

Training and fine-tuning large language models (LLMs) come with challenges related to memory and computational requirements due to the increasing size of the model weights and the optimizer states. Various techniques have been developed to tackle these challenges, such as low-rank adaptation (LoRA), which involves introducing a parallel trainable low-rank matrix to the fixed pre-trained weights at each layer. However, these methods often fall short compared to the full-rank weight training approach, as they restrict the parameter search to a low-rank subspace. This limitation can disrupt training dynamics and require a full-rank warm start to mitigate the impact. In this paper, we introduce a new method inspired by a phenomenon we formally prove: as training progresses, the rank of the estimated layer gradients gradually decreases, and asymptotically approaches rank one. Leveraging this, our approach involves adaptively reducing the rank of the gradients during Adam optimization steps, using an efficient online-updating low-rank projections rule. We further present a randomized SVD scheme for efficiently finding the projection matrix. Our technique enables full-parameter fine-tuning with adaptive low-rank gradient updates, significantly reducing overall memory requirements during training compared to state-of-the-art methods while improving model performance in both pretraining and fine-tuning. Finally, we provide a convergence analysis of our method and demonstrate its merits for training and fine-tuning language and biological foundation models.

large language model, machine learning, natural language, (18 more...)

2410.17881

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
(2 more...)

Genre: Research Report > Promising Solution (0.34)

Industry: Health & Medicine (0.87)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Chaubard, Francois, Eddy, Duncan, Kochenderfer, Mykel J.

Beyond Gradient Averaging in Parallel Optimization: Improved Robustness through Gradient Agreement Filtering

arXiv.org Artificial IntelligenceDec-29-2024

We introduce Gradient Agreement Filtering (GAF) to improve on gradient averaging in distributed deep learning optimization. Traditional distributed data-parallel stochastic gradient descent involves averaging gradients of microbatches to calculate a macrobatch gradient that is then used to update model parameters. We find that gradients across microbatches are often orthogonal or negatively correlated, especially in late stages of training, which leads to memorization of the training set, reducing generalization. In this paper, we introduce a simple, computationally effective way to reduce gradient variance by computing the cosine distance between micro-gradients during training and filtering out conflicting updates prior to averaging. We improve validation accuracy with significantly smaller microbatch sizes. We also show this reduces memorizing noisy labels. We demonstrate the effectiveness of this technique on standard image classification benchmarks including CIFAR-100 and CIFAR-100N-Fine. We show this technique consistently outperforms validation accuracy, in some cases by up to 18.2\% compared to traditional training approaches while reducing the computation required nearly an order of magnitude because we can now rely on smaller microbatch sizes without destabilizing training.

artificial intelligence, cosine distance, machine learning, (15 more...)

2412.18052

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Russia (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)