AITopics

2512.03225

Country:

North America > United States > New York (0.04)
Europe > United Kingdom > England > Bristol (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)
Europe > Austria (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Kamber, Anil, Parhi, Rahul

Sharpness of Minima in Deep Matrix Factorization: Exact Expressions

arXiv.org Machine LearningDec-4-2025

Understanding the geometry of the loss landscape near a minimum is key to explaining the implicit bias of gradient-based methods in non-convex optimization problems such as deep neural network training and deep matrix factorization. A central quantity to characterize this geometry is the maximum eigenvalue of the Hessian of the loss, which measures the sharpness of the landscape. Currently, its precise role has been obfuscated because no exact expressions for this sharpness measure were known in general settings. In this paper, we present the first exact expression for the maximum eigenvalue of the Hessian of the squared-error loss at any minimizer in general overparameterized deep matrix factorization (i.e., deep linear neural network training) problems, resolving an open question posed by Mulayoff & Michaeli (2020). This expression uncovers a fundamental property of the loss landscape of depth-2 matrix factorization problems: a minimum is flat if and only if it is spectral-norm balanced, which implies that flat minima are not necessarily Frobenius-norm balanced. Furthermore, to complement our theory, we empirically investigate an escape phenomenon observed during gradient-based training near a minimum that crucially relies on our exact expression of the sharpness. Decades of research in learning theory suggest limiting model complexity to prevent overfitting.

eigenvalue, factorization, matrix factorization, (13 more...)

2509.25783

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Bell, Zoë Ruha, Thudi, Anvith, Franzese-McLaughlin, Olive, Papernot, Nicolas, Goldwasser, Shafi

Efficient Public Verification of Private ML via Regularization

arXiv.org Artificial IntelligenceDec-4-2025

Training with differential privacy (DP) provides a guarantee to members in a dataset that they cannot be identified by users of the released model. However, those data providers, and, in general, the public, lack methods to efficiently verify that models trained on their data satisfy DP guarantees. The amount of compute needed to verify DP guarantees for current algorithms scales with the amount of compute required to train the model. In this paper we design the first DP algorithm with near optimal privacy-utility trade-offs but whose DP guarantees can be verified cheaper than training. We focus on DP stochastic convex optimization (DP-SCO), where optimal privacy-utility trade-offs are known. Here we show we can obtain tight privacy-utility trade-offs by privately minimizing a series of regularized objectives and only using the standard DP composition bound. Crucially, this method can be verified with much less compute than training. This leads to the first known DP-SCO algorithm with near optimal privacy-utility whose DP verification scales better than training cost, significantly reducing verification costs on large datasets.

algorithm, artificial intelligence, machine learning, (18 more...)

2512.04008

Country:

North America > Canada > Ontario (0.28)
North America > United States > California (0.28)

Genre: Research Report (0.83)

Industry:

Information Technology > Security & Privacy (1.00)
Government (0.67)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

arXiv.org Artificial IntelligenceDec-3-2025

Retrieval-Augmented Memory for Online Learning

Du, Wenzhang

Retrieval-augmented models couple parametric predictors with non-parametric memories, but their use in streaming supervised learning with concept drift is not well understood. We study online classification in non-stationary environments and propose Retrieval-Augmented Memory for Online Learning (RAM-OL), a simple extension of stochastic gradient descent that maintains a small buffer of past examples. At each time step, RAM-OL retrieves a few nearest neighbours of the current input in the hidden representation space and updates the model jointly on the current example and the retrieved neighbours. We compare a naive replay variant with a gated replay variant that constrains neighbours using a time window, similarity thresholds, and gradient reweighting, in order to balance fast reuse of relevant past data against robustness to outdated regimes. From a theoretical perspective, we interpret RAM-OL under a bounded drift model and discuss how retrieval can reduce adaptation cost and improve regret constants when patterns recur over time. Empirically, we instantiate RAM-OL on a simple online multilayer perceptron and evaluate it on three real-world data streams derived from electricity pricing, electricity load, and airline delay data. On strongly and periodically drifting streams, RAM-OL improves prequential accuracy by up to about seven percentage points and greatly reduces variance across random seeds, while on a noisy airline stream the gated variant closely matches the purely online baseline. These results show that retrieval-augmented memory is a practical and robust tool for online learning under concept drift.

artificial intelligence, baseline, machine learning, (17 more...)

2512.02333

Country: North America > United States (1.00)

Genre: Research Report > New Finding (0.34)

Industry:

Education > Educational Setting > Online (1.00)
Government > Regional Government > North America Government > United States Government (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Qiao, Dan, Wang, Yu-Xiang

Does Flatness imply Generalization for Logistic Loss in Univariate Two-Layer ReLU Network?

arXiv.org Machine LearningDec-2-2025

We consider the problem of generalization of arbitrarily overparameterized two-layer ReLU Neural Networks with univariate input. Recent work showed that under square loss, flat solutions (motivated by flat / stable minima and Edge of Stability phenomenon) provably cannot overfit, but it remains unclear whether the same phenomenon holds for logistic loss. This is a puzzling open problem because existing work on logistic loss shows that gradient descent with increasing step size converges to interpolating solutions (at infinity, for the margin-separable cases). In this paper, we prove that the \emph{flatness implied generalization} is more delicate under logistic loss. On the positive side, we show that flat solutions enjoy near-optimal generalization bounds within a region between the left-most and right-most \emph{uncertain} sets determined by each candidate solution. On the negative side, we show that there exist arbitrarily flat yet overfitting solutions at infinity that are (falsely) certain everywhere, thus certifying that flatness alone is insufficient for generalization in general. We demonstrate the effects predicted by our theory in a well-controlled simulation study.

converge, generalization, theorem 3, (13 more...)

2512.01473

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Patel, Deep, Vlatakis-Gkaragkounis, Emmanouil-Vasileios

Solving Neural Min-Max Games: The Role of Architecture, Initialization & Dynamics

arXiv.org Machine LearningDec-2-2025

Many emerging applications - such as adversarial training, AI alignment, and robust optimization - can be framed as zero-sum games between neural nets, with von Neumann-Nash equilibria (NE) capturing the desirable system behavior. While such games often involve non-convex non-concave objectives, empirical evidence shows that simple gradient methods frequently converge, suggesting a hidden geometric structure. In this paper, we provide a theoretical framework that explains this phenomenon through the lens of hidden convexity and overparameterization. We identify sufficient conditions - spanning initialization, training dynamics, and network width - that guarantee global convergence to a NE in a broad class of non-convex min-max games. To our knowledge, this is the first such result for games that involve two-layer neural networks. Technically, our approach is twofold: (a) we derive a novel path-length bound for the alternating gradient descent-ascent scheme in min-max games; and (b) we show that the reduction from a hidden convex-concave geometry to two-sided Polyak-Łojasiewicz (PŁ) min-max condition hold with high probability under overparameterization, using tools from random matrix theory.

neural network, optimization, overparameterization, (16 more...)

2512.00389

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Texas (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Leisure & Entertainment > Games (1.00)
Government (1.00)
Information Technology (0.92)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(2 more...)

Etuk, Stella, Matrawy, Ashraf

SA-ADP: Sensitivity-Aware Adaptive Differential Privacy for Large Language Models

Despite advances in the use of large language models (LLMs) in downstream tasks, their ability to memorize information has raised privacy concerns. Therefore, protecting personally identifiable information (PII) during LLM training remains a fundamental challenge. Conventional methods like Differential Privacy-Stochastic Gradient Descent (DP-SGD) provide robust privacy protection via uniform noising, protecting PII regardless of its distinct sensitivity. This comes at the expense of the model's utility, leading to a trade-off. In this paper, we propose SA-ADP, a sensitivity-aware approach that allocates noise based on the sensitivity of individual PII. We evaluated our method on four datasets (ABCD, CUSTOMERSIM, Wikitext-2, and UNSW-NB15 ). Our results show that SA-ADP achieves results comparable to the baseline (No-DP) and the conventional DP-SGD. This means that our method did not degrade the model's utility while still maintaining strong privacy protection.

large language model, machine learning, natural language, (17 more...)

2512.01748

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

On the Tension Between Optimality and Adversarial Robustness in Policy Optimization

Li, Haoran, Lv, Jiayu, Han, Congying, Zhang, Zicheng, Li, Anqi, Liu, Yan, Guo, Tiande, Jiang, Nan

Achieving optimality and adversarial robustness in deep reinforcement learning has long been regarded as conflicting goals. Nonetheless, recent theoretical insights presented in CAR suggest a potential alignment, raising the important question of how to realize this in practice. This paper first identifies a key gap between theory and practice by comparing standard policy optimization (SPO) and adversarially robust policy optimization (ARPO). Although they share theoretical consistency, a fundamental tension between robustness and optimality arises in practical policy gradient methods. SPO tends toward convergence to vulnerable first-order stationary policies (FOSPs) with strong natural performance, whereas ARPO typically favors more robust FOSPs at the expense of reduced returns. Furthermore, we attribute this tradeoff to the reshaping effect of the strongest adversary in ARPO, which significantly complicates the global landscape by inducing deceptive sticky FOSPs. This improves robustness but makes navigation more challenging. To alleviate this, we develop the BARPO, a bilevel framework unifying SPO and ARPO by modulating adversary strength, thereby facilitating navigability while preserving global optima. Extensive empirical results demonstrate that BARPO consistently outperforms vanilla ARPO, providing a practical approach to reconcile theoretical and empirical performance.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2512.01228

Country: North America > United States > Illinois (0.27)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (0.93)
Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

Provable Benefit of Sign Descent: A Minimal Model Under Heavy-Tailed Class Imbalance

Yadav, Robin, Xie, Shuo, Wang, Tianhao, Li, Zhiyuan

Adaptive optimization methods (such as Adam) play a major role in LLM pretraining, significantly outperforming Gradient Descent (GD). Recent studies have proposed new smoothness assumptions on the loss function to explain the advantages of adaptive algorithms with structured preconditioners, e.g., coordinate-wise or layer-wise, and steepest descent methods w.r.t. non-euclidean norms, e.g., $\ell_\infty$ norm or spectral norm, over GD. However, it remains unclear how these smoothness assumptions manifest in language modelling tasks. In this work, we aim to analyze the benefit of $\ell_\infty$-norm descent (a.k.a. sign descent) directly from properties of the data distribution, namely, heavy-tailed class imbalance. We propose a minimal yet representative setting of next-token prediction, where we can provably show faster convergence of coordinate-wise algorithms such as Sign descent (steepest descent w.r.t. $\ell_\infty$ norm) over normalized GD (steepest descent w.r.t. to $\ell_2$ norm) in the presence of heavy tail class imbalance.

large language model, machine learning, natural language, (16 more...)

2512.00763

Country:

North America > United States (0.46)
Europe > Austria (0.28)

Genre: Research Report (0.85)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

AWP: Activation-Aware Weight Pruning and Quantization with Projected Gradient Descent

Liu, Jing, Koike-Akino, Toshiaki, Wang, Ye, Mansour, Hassan, Brand, Matthew

To address the enormous size of Large Language Models (LLMs), model compression methods, such as quantization and pruning, are often deployed, especially on edge devices. In this work, we focus on layer-wise post-training quantization and pruning. Drawing connections between activation-aware weight pruning and sparse approximation problems, and motivated by the success of Iterative Hard Thresholding (IHT), we propose a unified method for Activation-aware Weight pruning and quantization via Projected gradient descent (AWP). Our experiments demonstrate that AWP outperforms state-of-the-art LLM pruning and quantization methods. Theoretical convergence guarantees of the proposed method for pruning are also provided.

large language model, machine learning, quantization, (15 more...)

2506.10205

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)