AITopics

2503.14562

Country:

Asia > Russia (0.15)
Europe > Russia > Volga Federal District > Saratov Oblast > Saratov (0.05)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.05)
Asia > Macao (0.05)

Genre: Research Report > New Finding (0.55)

Industry: Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (0.98)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

High-entropy Advantage in Neural Networks' Generalizability

Yang, Entao, Zhang, Xiaotian, Shang, Yue, Zhang, Ge

While the 2024 Nobel Prize in Physics ignites a worldwide discussion on the origins of neural networks and their foundational links to physics, modern machine learning research predominantly focuses on computational and algorithmic advancements, overlooking a picture of physics. Here we introduce the concept of entropy into neural networks by reconceptualizing them as hypothetical physical systems where each parameter is a non-interacting 'particle' within a one-dimensional space. By employing a Wang-Landau algorithms, we construct the neural networks' (with up to 1 million parameters) entropy landscapes as functions of training loss and test accuracy (or loss) across four distinct machine learning tasks, including arithmetic question, real-world tabular data, image recognition, and language modeling. Our results reveal the existence of \textit{entropy advantage}, where the high-entropy states generally outperform the states reached via classical training optimizer like stochastic gradient descent. We also find this advantage is more pronounced in narrower networks, indicating a need of different training optimizers tailored to different sizes of neural networks.

artificial intelligence, machine learning, neural network, (16 more...)

2503.13145

Country:

North America > United States (0.46)
Asia > China (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Materials > Chemicals > Industrial Gases > Liquified Gas (0.68)
Materials > Chemicals > Commodity Chemicals > Petrochemicals > LNG (0.68)
Energy > Oil & Gas > Midstream (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Quiroga, David, Han, Jason, Kyrillidis, Anastasios

Quantum EigenGame for excited state calculation

Quantum computing offers an alternative approach to solving complex computational tasks, potentially reducing the time and space complexity compared to classical methods. Quantum algorithms -like Quantum Phase Estimation [1], the Deutsch-Jozsa algorithm [2], and Grover's algorithm [3]- demonstrate superior performance in ideal, noiseless conditions. However, in the Noisy Intermediate-Scale Quantum (NISQ) era [4], noise remains a significant challenge, influencing the stability and reliability of quantum computations [5-8]. Performing optimization tasks under noisy settings is a common scenario in the algorithmic literature. In optimization and machine learning, errors that propagate throughout iterations critically influence performance metrics and outcomes [9-12]. Understanding and mitigating error propagation is crucial for enhancing the practical utility of algorithms in real-world applications. Particularly relevant to the present work, consider the case of derivative-free optimization (DFO) [13-18]: DFO is employed effectively in scenarios where traditional gradient-based methods falter [16]. However, the efficiency of DFO methods often lags, particularly for high-dimensional problems, due to their reliance on sampling routines that may require many function evaluations to approximate gradients [15]. Further, DFO may struggle with precision near minima [17].

artificial intelligence, eigengame, machine learning, (17 more...)

2503.13644

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Virginia (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Partohaghighi, Mohammad, Marcia, Roummel, Chen, YangQuan

Effective Dimension Aware Fractional-Order Stochastic Gradient Descent for Convex Optimization Problems

Fractional-order stochastic gradient descent (FOSGD) leverages a fractional exponent to capture long-memory effects in optimization, yet its practical impact is often constrained by the difficulty of tuning and stabilizing this exponent. In this work, we introduce 2SED Fractional-Order Stochastic Gradient Descent (2SEDFOSGD), a novel method that synergistically combines the Two-Scale Effective Dimension (2SED) algorithm with FOSGD to automatically calibrate the fractional exponent in a data-driven manner. By continuously gauging model sensitivity and effective dimensionality, 2SED dynamically adjusts the exponent to curb erratic oscillations and enhance convergence rates. Theoretically, we demonstrate how this dimension-aware adaptation retains the benefits of fractional memory while averting the sluggish or unstable behaviors frequently observed in naive fractional SGD. Empirical evaluations across multiple benchmarks confirm that our 2SED-driven fractional exponent approach not only converges faster but also achieves more robust final performance, suggesting broad applicability for fractional-order methodologies in large-scale machine learning and related domains.

artificial intelligence, fractional exponent, machine learning, (14 more...)

2503.13764

Country:

North America > United States > California > Merced County > Merced (0.14)
North America > United States > Wisconsin > Milwaukee County > Milwaukee (0.04)
North America > United States > Indiana > Hamilton County > Fishers (0.04)

Genre:

Research Report > New Finding (0.68)
Research Report > Promising Solution (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

PAUSE: Low-Latency and Privacy-Aware Active User Selection for Federated Learning

Peleg, Ori, Lang, Natalie, Rini, Stefano, Shlezinger, Nir, Cohen, Kobi

Federated learning (FL) enables multiple edge devices to collaboratively train a machine learning model without the need to share potentially private data. Federated learning proceeds through iterative exchanges of model updates, which pose two key challenges: First, the accumulation of privacy leakage over time, and second, communication latency. These two limitations are typically addressed separately: The former via perturbed updates to enhance privacy and the latter using user selection to mitigate latency - both at the expense of accuracy. In this work, we propose a method that jointly addresses the accumulation of privacy leakage and communication latency via active user selection, aiming to improve the trade-off among privacy, latency, and model performance. To achieve this, we construct a reward function that accounts for these three objectives. Building on this reward, we propose a multi-armed bandit (MAB)-based algorithm, termed Privacy-aware Active User SElection (PAUSE) which dynamically selects a subset of users each round while ensuring bounded overall privacy leakage. We establish a theoretical analysis, systematically showing that the reward growth rate of PAUSE follows that of the best-known rate in MAB literature. To address the complexity overhead of active user selection, we propose a simulated annealing-based relaxation of PAUSE and analyze its ability to approximate the reward-maximizing policy under reduced complexity. We numerically validate the privacy leakage, associated improved latency, and accuracy gains of our methods for the federated training in various scenarios.

artificial intelligence, federated learning, machine learning, (17 more...)

2503.13173

Country:

North America > United States > California > Santa Barbara County > Santa Barbara (0.04)
Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
Asia > Taiwan (0.04)
Asia > Middle East > Israel > Southern District > Beer-Sheva (0.04)

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Revisiting Gradient Descent: A Dual-Weight Method for Improved Learning

Wang, Xi

In neural networks, the weight vector W of a neuron plays a crucial role in transforming input features into outputs. While representing synaptic weights of postsynaptic neurons from presynaptic neurons, W can also be viewed as the neuron's encoding of the target concept it aims to represent. However, defining a target concept independently from other concepts often results in insufficient representation; rather, effective learning necessitates contrasting the target with non-targets. For instance, to accurately define a "dog," it is essential not only to understand the characteristics of dogs but also to distinguish them from non-dog entities. Without this contrast, differentiation remains incomplete. Similarly, when a neuron learns, it should capture the differences between the features of the target class (hereafter termed positive examples) and those of non-target classes (negative examples).

artificial intelligence, machine learning, neuron, (16 more...)

2503.11965

Country: North America > United States > California (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

Tankala, Chandan, Nagaraj, Dheeraj M., Raj, Anant

Beyond Propagation of Chaos: A Stochastic Algorithm for Mean Field Optimization

arXiv.org Machine LearningMar-17-2025

Gradient flow in the 2-Wasserstein space is widely used to optimize functionals over probability distributions and is typically implemented using an interacting particle system with $n$ particles. Analyzing these algorithms requires showing (a) that the finite-particle system converges and/or (b) that the resultant empirical distribution of the particles closely approximates the optimal distribution (i.e., propagation of chaos). However, establishing efficient sufficient conditions can be challenging, as the finite particle system may produce heavily dependent random variables. In this work, we study the virtual particle stochastic approximation, originally introduced for Stein Variational Gradient Descent. This method can be viewed as a form of stochastic gradient descent in the Wasserstein space and can be implemented efficiently. In popular settings, we demonstrate that our algorithm's output converges to the optimal distribution under conditions similar to those for the infinite particle limit, and it produces i.i.d. samples without the need to explicitly establish propagation of chaos bounds.

artificial intelligence, inequality, machine learning, (16 more...)

arXiv.org Machine Learning

2503.13115

Country:

North America > United States > Oregon > Lane County > Eugene (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)

arXiv.org Machine LearningMar-17-2025

An Optimization Framework for Differentially Private Sparse Fine-Tuning

Makni, Mehdi, Behdin, Kayhan, Afriat, Gabriel, Xu, Zheng, Vassilvitskii, Sergei, Ponomareva, Natalia, Hazimeh, Hussein, Mazumder, Rahul

Differentially private stochastic gradient descent (DP-SGD) is broadly considered to be the gold standard for training and fine-tuning neural networks under differential privacy (DP). With the increasing availability of high-quality pre-trained model checkpoints (e.g., vision and language models), fine-tuning has become a popular strategy. However, despite recent progress in understanding and applying DP-SGD for private transfer learning tasks, significant challenges remain -- most notably, the performance gap between models fine-tuned with DP-SGD and their non-private counterparts. Sparse fine-tuning on private data has emerged as an alternative to full-model fine-tuning; recent work has shown that privately fine-tuning only a small subset of model weights and keeping the rest of the weights fixed can lead to better performance. In this work, we propose a new approach for sparse fine-tuning of neural networks under DP. Existing work on private sparse finetuning often used fixed choice of trainable weights (e.g., updating only the last layer), or relied on public model's weights to choose the subset of weights to modify. Such choice of weights remains suboptimal. In contrast, we explore an optimization-based approach, where our selection method makes use of the private gradient information, while using off the shelf privacy accounting techniques. Our numerical experiments on several computer vision models and datasets show that our selection method leads to better prediction accuracy, compared to full-model private fine-tuning or existing private sparse fine-tuning approaches.

artificial intelligence, fine-tuning, machine learning, (18 more...)

arXiv.org Machine Learning

2503.12822

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Mountain View (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

arXiv.org Machine LearningMar-16-2025

In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention

He, Jianliang, Pan, Xintian, Chen, Siyu, Yang, Zhuoran

We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Through extensive empirical experiments and rigorous theoretical analysis, we demystify the emergence of elegant attention patterns: a diagonal and homogeneous pattern in the key-query (KQ) weights, and a last-entry-only and zero-sum pattern in the output-value (OV) weights. Remarkably, these patterns consistently appear from gradient-based training starting from random initialization. Our analysis reveals that such emergent structures enable multi-head attention to approximately implement a debiased gradient descent predictor -- one that outperforms single-head attention and nearly achieves Bayesian optimality up to proportional factor. Furthermore, compared to linear transformers, the softmax attention readily generalizes to sequences longer than those seen during training. We also extend our study to scenarios with non-isotropic covariates and multi-task linear regression. In the former, multi-head attention learns to implement a form of pre-conditioned gradient descent. In the latter, we uncover an intriguing regime where the interplay between head number and task number triggers a superposition phenomenon that efficiently resolves multi-task in-context learning. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution, paving the way for deeper understanding and broader applications of in-context learning.

artificial intelligence, attention model, machine learning, (17 more...)

arXiv.org Machine Learning

2503.12734

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

arXiv.org Artificial IntelligenceMar-15-2025

Towards Learning High-Precision Least Squares Algorithms with Sequence Models

Liu, Jerry, Grogan, Jessica, Dugan, Owen, Rao, Ashish, Arora, Simran, Rudra, Atri, Ré, Christopher

This paper investigates whether sequence models can learn to perform numerical algorithms, e.g. gradient descent, on the fundamental problem of least squares. Our goal is to inherit two properties of standard algorithms from numerical analysis: (1) machine precision, i.e. we want to obtain solutions that are accurate to near floating point error, and (2) numerical generality, i.e. we want them to apply broadly across problem instances. We find that prior approaches using Transformers fail to meet these criteria, and identify limitations present in existing architectures and training procedures. First, we show that softmax Transformers struggle to perform high-precision multiplications, which prevents them from precisely learning numerical algorithms. Second, we identify an alternate class of architectures, comprised entirely of polynomials, that can efficiently represent high-precision gradient descent iterates. Finally, we investigate precision bottlenecks during training and address them via a high-precision training recipe that reduces stochastic gradient noise. Our recipe enables us to train two polynomial architectures, gated convolutions and linear attention, to perform gradient descent iterates on least squares problems. For the first time, we demonstrate the ability to train to near machine precision. Applied iteratively, our models obtain 100,000x lower MSE than standard Transformers trained end-to-end and they incur a 10,000x smaller generalization gap on out-of-distribution problems. We make progress towards end-to-end learning of numerical algorithms for least squares.

artificial intelligence, deep learning, machine learning, (20 more...)

2503.12295

Country:

North America > United States > California > Santa Clara County > Stanford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > Rhode Island > Providence County > Providence (0.04)
(3 more...)

Genre: Research Report > New Finding (0.45)

Industry:

Information Technology (0.67)
Government > Regional Government > North America Government (0.47)

Technology:

Information Technology > Mathematics of Computing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)