Goto

Collaborating Authors

 Gradient Descent


Analysis of human visual field information using machine learning methods and assessment of their accuracy

arXiv.org Artificial Intelligence

Subject of research: is the study of methods for analyzing perimetric images for the diagnosis and control of glaucoma diseases. Objects of research: is a dataset collected on the ophthalmological perimeter with the results of various patient pathologies, since the ophthalmological community is acutely aware of the issue of disease control and import substitution. [5]. Purpose of research: is to consider various machine learning methods that can classify glaucoma. This is possible thanks to the classifier built after labeling the dataset. It is able to determine from the image whether the visual fields depicted on it are the results of the impact of glaucoma on the eyes or other visual diseases. Earlier in the work [3], a dataset was described that was collected on the Tomey perimeter. The average age of the examined patients ranged from 30 to 85 years. Methods of research: machine learning methods for classifying image results (stochastic gradient descent, logistic regression, random forest, naive Bayes). Main results of research: the result of the study is computer modeling that can determine from the image whether the result is glaucoma or another disease (binary classification).


High-entropy Advantage in Neural Networks' Generalizability

arXiv.org Artificial Intelligence

While the 2024 Nobel Prize in Physics ignites a worldwide discussion on the origins of neural networks and their foundational links to physics, modern machine learning research predominantly focuses on computational and algorithmic advancements, overlooking a picture of physics. Here we introduce the concept of entropy into neural networks by reconceptualizing them as hypothetical physical systems where each parameter is a non-interacting 'particle' within a one-dimensional space. By employing a Wang-Landau algorithms, we construct the neural networks' (with up to 1 million parameters) entropy landscapes as functions of training loss and test accuracy (or loss) across four distinct machine learning tasks, including arithmetic question, real-world tabular data, image recognition, and language modeling. Our results reveal the existence of \textit{entropy advantage}, where the high-entropy states generally outperform the states reached via classical training optimizer like stochastic gradient descent. We also find this advantage is more pronounced in narrower networks, indicating a need of different training optimizers tailored to different sizes of neural networks.


Quantum EigenGame for excited state calculation

arXiv.org Artificial Intelligence

Quantum computing offers an alternative approach to solving complex computational tasks, potentially reducing the time and space complexity compared to classical methods. Quantum algorithms -like Quantum Phase Estimation [1], the Deutsch-Jozsa algorithm [2], and Grover's algorithm [3]- demonstrate superior performance in ideal, noiseless conditions. However, in the Noisy Intermediate-Scale Quantum (NISQ) era [4], noise remains a significant challenge, influencing the stability and reliability of quantum computations [5-8]. Performing optimization tasks under noisy settings is a common scenario in the algorithmic literature. In optimization and machine learning, errors that propagate throughout iterations critically influence performance metrics and outcomes [9-12]. Understanding and mitigating error propagation is crucial for enhancing the practical utility of algorithms in real-world applications. Particularly relevant to the present work, consider the case of derivative-free optimization (DFO) [13-18]: DFO is employed effectively in scenarios where traditional gradient-based methods falter [16]. However, the efficiency of DFO methods often lags, particularly for high-dimensional problems, due to their reliance on sampling routines that may require many function evaluations to approximate gradients [15]. Further, DFO may struggle with precision near minima [17].


Effective Dimension Aware Fractional-Order Stochastic Gradient Descent for Convex Optimization Problems

arXiv.org Artificial Intelligence

Fractional-order stochastic gradient descent (FOSGD) leverages a fractional exponent to capture long-memory effects in optimization, yet its practical impact is often constrained by the difficulty of tuning and stabilizing this exponent. In this work, we introduce 2SED Fractional-Order Stochastic Gradient Descent (2SEDFOSGD), a novel method that synergistically combines the Two-Scale Effective Dimension (2SED) algorithm with FOSGD to automatically calibrate the fractional exponent in a data-driven manner. By continuously gauging model sensitivity and effective dimensionality, 2SED dynamically adjusts the exponent to curb erratic oscillations and enhance convergence rates. Theoretically, we demonstrate how this dimension-aware adaptation retains the benefits of fractional memory while averting the sluggish or unstable behaviors frequently observed in naive fractional SGD. Empirical evaluations across multiple benchmarks confirm that our 2SED-driven fractional exponent approach not only converges faster but also achieves more robust final performance, suggesting broad applicability for fractional-order methodologies in large-scale machine learning and related domains.


PAUSE: Low-Latency and Privacy-Aware Active User Selection for Federated Learning

arXiv.org Artificial Intelligence

Federated learning (FL) enables multiple edge devices to collaboratively train a machine learning model without the need to share potentially private data. Federated learning proceeds through iterative exchanges of model updates, which pose two key challenges: First, the accumulation of privacy leakage over time, and second, communication latency. These two limitations are typically addressed separately: The former via perturbed updates to enhance privacy and the latter using user selection to mitigate latency - both at the expense of accuracy. In this work, we propose a method that jointly addresses the accumulation of privacy leakage and communication latency via active user selection, aiming to improve the trade-off among privacy, latency, and model performance. To achieve this, we construct a reward function that accounts for these three objectives. Building on this reward, we propose a multi-armed bandit (MAB)-based algorithm, termed Privacy-aware Active User SElection (PAUSE) which dynamically selects a subset of users each round while ensuring bounded overall privacy leakage. We establish a theoretical analysis, systematically showing that the reward growth rate of PAUSE follows that of the best-known rate in MAB literature. To address the complexity overhead of active user selection, we propose a simulated annealing-based relaxation of PAUSE and analyze its ability to approximate the reward-maximizing policy under reduced complexity. We numerically validate the privacy leakage, associated improved latency, and accuracy gains of our methods for the federated training in various scenarios.


Revisiting Gradient Descent: A Dual-Weight Method for Improved Learning

arXiv.org Artificial Intelligence

In neural networks, the weight vector W of a neuron plays a crucial role in transforming input features into outputs. While representing synaptic weights of postsynaptic neurons from presynaptic neurons, W can also be viewed as the neuron's encoding of the target concept it aims to represent. However, defining a target concept independently from other concepts often results in insufficient representation; rather, effective learning necessitates contrasting the target with non-targets. For instance, to accurately define a "dog," it is essential not only to understand the characteristics of dogs but also to distinguish them from non-dog entities. Without this contrast, differentiation remains incomplete. Similarly, when a neuron learns, it should capture the differences between the features of the target class (hereafter termed positive examples) and those of non-target classes (negative examples).


Beyond Propagation of Chaos: A Stochastic Algorithm for Mean Field Optimization

arXiv.org Machine Learning

Gradient flow in the 2-Wasserstein space is widely used to optimize functionals over probability distributions and is typically implemented using an interacting particle system with $n$ particles. Analyzing these algorithms requires showing (a) that the finite-particle system converges and/or (b) that the resultant empirical distribution of the particles closely approximates the optimal distribution (i.e., propagation of chaos). However, establishing efficient sufficient conditions can be challenging, as the finite particle system may produce heavily dependent random variables. In this work, we study the virtual particle stochastic approximation, originally introduced for Stein Variational Gradient Descent. This method can be viewed as a form of stochastic gradient descent in the Wasserstein space and can be implemented efficiently. In popular settings, we demonstrate that our algorithm's output converges to the optimal distribution under conditions similar to those for the infinite particle limit, and it produces i.i.d. samples without the need to explicitly establish propagation of chaos bounds.


An Optimization Framework for Differentially Private Sparse Fine-Tuning

arXiv.org Machine Learning

Differentially private stochastic gradient descent (DP-SGD) is broadly considered to be the gold standard for training and fine-tuning neural networks under differential privacy (DP). With the increasing availability of high-quality pre-trained model checkpoints (e.g., vision and language models), fine-tuning has become a popular strategy. However, despite recent progress in understanding and applying DP-SGD for private transfer learning tasks, significant challenges remain -- most notably, the performance gap between models fine-tuned with DP-SGD and their non-private counterparts. Sparse fine-tuning on private data has emerged as an alternative to full-model fine-tuning; recent work has shown that privately fine-tuning only a small subset of model weights and keeping the rest of the weights fixed can lead to better performance. In this work, we propose a new approach for sparse fine-tuning of neural networks under DP. Existing work on private sparse finetuning often used fixed choice of trainable weights (e.g., updating only the last layer), or relied on public model's weights to choose the subset of weights to modify. Such choice of weights remains suboptimal. In contrast, we explore an optimization-based approach, where our selection method makes use of the private gradient information, while using off the shelf privacy accounting techniques. Our numerical experiments on several computer vision models and datasets show that our selection method leads to better prediction accuracy, compared to full-model private fine-tuning or existing private sparse fine-tuning approaches.


In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention

arXiv.org Machine Learning

We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Through extensive empirical experiments and rigorous theoretical analysis, we demystify the emergence of elegant attention patterns: a diagonal and homogeneous pattern in the key-query (KQ) weights, and a last-entry-only and zero-sum pattern in the output-value (OV) weights. Remarkably, these patterns consistently appear from gradient-based training starting from random initialization. Our analysis reveals that such emergent structures enable multi-head attention to approximately implement a debiased gradient descent predictor -- one that outperforms single-head attention and nearly achieves Bayesian optimality up to proportional factor. Furthermore, compared to linear transformers, the softmax attention readily generalizes to sequences longer than those seen during training. We also extend our study to scenarios with non-isotropic covariates and multi-task linear regression. In the former, multi-head attention learns to implement a form of pre-conditioned gradient descent. In the latter, we uncover an intriguing regime where the interplay between head number and task number triggers a superposition phenomenon that efficiently resolves multi-task in-context learning. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution, paving the way for deeper understanding and broader applications of in-context learning.


Towards Learning High-Precision Least Squares Algorithms with Sequence Models

arXiv.org Artificial Intelligence

This paper investigates whether sequence models can learn to perform numerical algorithms, e.g. gradient descent, on the fundamental problem of least squares. Our goal is to inherit two properties of standard algorithms from numerical analysis: (1) machine precision, i.e. we want to obtain solutions that are accurate to near floating point error, and (2) numerical generality, i.e. we want them to apply broadly across problem instances. We find that prior approaches using Transformers fail to meet these criteria, and identify limitations present in existing architectures and training procedures. First, we show that softmax Transformers struggle to perform high-precision multiplications, which prevents them from precisely learning numerical algorithms. Second, we identify an alternate class of architectures, comprised entirely of polynomials, that can efficiently represent high-precision gradient descent iterates. Finally, we investigate precision bottlenecks during training and address them via a high-precision training recipe that reduces stochastic gradient noise. Our recipe enables us to train two polynomial architectures, gated convolutions and linear attention, to perform gradient descent iterates on least squares problems. For the first time, we demonstrate the ability to train to near machine precision. Applied iteratively, our models obtain 100,000x lower MSE than standard Transformers trained end-to-end and they incur a 10,000x smaller generalization gap on out-of-distribution problems. We make progress towards end-to-end learning of numerical algorithms for least squares.