AITopics | gauss-newton

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Middle East > Israel (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsFeb-13-2026, 08:19:09 GMT

6a14c7f9fb3f42645cfa6bd5aa446819-Paper-Conference.pdf

artificial intelligence, machine learning, regime, (17 more...)

Country:

Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
North America > United States (0.04)
Asia > Middle East > Israel > Haifa District > Haifa (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

arXiv.org Artificial IntelligenceNov-27-2025

Deceptron: Learned Local Inverses for Fast and Stable Physics Inversion

Kachhadiya, Aaditya L.

Inverse problems in the physical sciences are often ill-conditioned in input space, making progress step-size sensitive. We propose the Deceptron, a lightweight bidirectional module that learns a local inverse of a differentiable forward surrogate. Training combines a supervised fit, forward-reverse consistency, a lightweight spectral penalty, a soft bias tie, and a Jacobian Composition Penalty (JCP) that encourages $J_g(f(x))\,J_f(x)\!\approx\!I$ via JVP/VJP probes. At solve time, D-IPG (Deceptron Inverse-Preconditioned Gradient) takes a descent step in output space, pulls it back through $g$, and projects under the same backtracking and stopping rules as baselines. On Heat-1D initial-condition recovery and a Damped Oscillator inverse problem, D-IPG reaches a fixed normalized tolerance with $\sim$20$\times$ fewer iterations on Heat and $\sim$2-3$\times$ fewer on Oscillator than projected gradient, competitive in iterations and cost with Gauss-Newton. Diagnostics show JCP reduces a measured composition error and tracks iteration gains. We also preview a single-scale 2D instantiation, DeceptronNet (v0), that learns few-step corrections under a strict fairness protocol and exhibits notably fast convergence.

artificial intelligence, iteration, machine learning, (17 more...)

2511.21076

Country: Asia > India (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

arXiv.org Machine LearningNov-13-2025

Non-Asymptotic Optimization and Generalization Bounds for Stochastic Gauss-Newton in Overparameterized Models

Cayci, Semih

An important question in deep learning is how higher-order optimization methods affect generalization. In this work, we analyze a stochastic Gauss-Newton (SGN) method with Levenberg-Marquardt damping and mini-batch sampling for training overparameterized deep neural networks with smooth activations in a regression setting. Our theoretical contributions are twofold. First, we establish finite-time convergence bounds via a variable-metric analysis in parameter space, with explicit dependencies on the batch size, network width and depth. Second, we derive non-asymptotic generalization bounds for SGN using uniform stability in the overparameterized regime, characterizing the impact of curvature, batch size, and overparameterization on generalization performance. Our theoretical results identify a favorable generalization regime for SGN in which a larger minimum eigenvalue of the Gauss-Newton matrix along the optimization path yields tighter stability bounds.

artificial intelligence, machine learning, nullnull null, (15 more...)

arXiv.org Machine Learning

2511.03972

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Heinlein, Alexander, Kapoor, Taniya

Domain decomposition architectures and Gauss-Newton training for physics-informed neural networks

arXiv.org Artificial IntelligenceNov-3-2025

Approximating the solutions of boundary value problems governed by partial differential equations with neural networks is challenging, largely due to the difficult training process. This difficulty can be partly explained by the spectral bias, that is, the slower convergence of high-frequency components, and can be mitigated by localizing neural networks via (overlapping) domain decomposition. We combine this localization with the Gauss-Newton method as the optimizer to obtain faster convergence than gradient-based schemes such as Adam; this comes at the cost of solving an ill-conditioned linear system in each iteration. Domain decomposition induces a block-sparse structure in the otherwise dense Gauss-Newton system, reducing the computational cost per iteration. Our numerical results indicate that combining localization and Gauss-Newton optimization is promising for neural network-based solvers for partial differential equations.

artificial intelligence, machine learning, neural network, (12 more...)

2510.27018

Country: Europe > Netherlands > South Holland > Delft (0.05)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceOct-16-2025

Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise

Liu, Bingbin, Bansal, Rachit, Morwani, Depen, Vyas, Nikhil, Alvarez-Melis, David, Kakade, Sham M.

Diagonal preconditioners are computationally feasible approximate to second-order optimizers, which have shown significant promise in accelerating training of deep learning models. Two predominant approaches are based on Adam and Gauss-Newton (GN) methods: the former leverages statistics of current gradients and is the de-factor optimizers for neural networks, and the latter uses the diagonal elements of the Gauss-Newton matrix and underpins some of the recent diagonal optimizers such as Sophia. In this work, we compare these two diagonal preconditioning methods through the lens of two key factors: the choice of basis in the preconditioner, and the impact of gradient noise from mini-batching. To gain insights, we analyze these optimizers on quadratic objectives and logistic regression under all four quadrants. We show that regardless of the basis, there exist instances where Adam outperforms both GN$^{-1}$ and GN$^{-1/2}$ in full-batch settings. Conversely, in the stochastic regime, Adam behaves similarly to GN$^{-1/2}$ for linear regression under a Gaussian data assumption. These theoretical results are supported by empirical studies on both convex and non-convex objectives.

artificial intelligence, eigenbasis, machine learning, (18 more...)

2510.1368

Country:

Europe > Austria > Vienna (0.14)
North America > United States > California > Los Angeles County > Long Beach (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(11 more...)

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceOct-13-2025

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Abreu, Natalie, Vyas, Nikhil, Kakade, Sham, Morwani, Depen

Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle. With rising compute requirements for training large language models (LLMs), improving optimization methods has become a central strategy for improving training efficiency. Better optimizers can directly reduce the serial runtime to train an LLM, which is crucial for large-scale models that train from days to months. Optimization for LLMs has traditionally leveraged first-order methods such as SGD and Adam (Kingma & Ba, 2017).

large language model, machine learning, natural language, (22 more...)

2510.09378

Country:

Asia > Middle East > Jordan (0.05)
North America > United States > Wisconsin > Dane County > Madison (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Neural Information Processing SystemsOct-10-2025, 21:09:29 GMT

Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization

However, the generalization properties of second-order methods are still being debated.

dataset, gauss-newton, neural network, (15 more...)

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Middle East > Israel (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsOct-8-2025, 20:29:10 GMT

Rethinking Gauss-Newton for learning over-parameterized models

This work studies the global convergence and implicit bias of Gauss Newton's (GN) when optimizing over-parameterized one-hidden layer networks in the mean-field regime. We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning. We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN's method. While GN is consistently faster than GD in finding a global optimum, the learned model generalizes well on test data when starting from random initial weights with a small variance and using a small step size to slow down convergence. Specifically, our study shows that such a setting results in a hidden learning phenomenon, where the dynamics are able to recover features with good generalization properties despite the model having sub-optimal training and test performances due to an under-optimized linear layer. This study exhibits a trade-off between the convergence speed of GN and the generalization ability of the learned solution.

artificial intelligence, machine learning, regime, (17 more...)

Country:

Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
North America > United States (0.04)
Asia > Middle East > Israel > Haifa District > Haifa (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

Gonzalez, Xavier, Kozachkov, Leo, Zoltowski, David M., Clarkson, Kenneth L., Linderman, Scott W.

Predictability Enables Parallelization of Nonlinear State Space Models

arXiv.org Machine LearningAug-26-2025

The rise of parallel computing hardware has made it increasingly important to understand which nonlinear state space models can be efficiently parallelized. Recent advances like DEER (arXiv:2309.12252) or DeepPCR (arXiv:2309.16318) have shown that evaluating a state space model can be recast as solving a parallelizable optimization problem, and sometimes this approach can yield dramatic speed-ups in evaluation time. However, the factors that govern the difficulty of these optimization problems remain unclear, limiting the larger adoption of the technique. In this work, we establish a precise relationship between the dynamics of a nonlinear system and the conditioning of its corresponding optimization formulation. We show that the predictability of a system, defined as the degree to which small perturbations in state influence future behavior, impacts the number of optimization steps required for evaluation. In predictable systems, the state trajectory can be computed in $O((\log T)^2)$ time, where $T$ is the sequence length, a major improvement over the conventional sequential approach. In contrast, chaotic or unpredictable systems exhibit poor conditioning, with the consequence that parallel evaluation converges too slowly to be useful. Importantly, our theoretical analysis demonstrates that for predictable systems, the optimization problem is always well-conditioned, whereas for unpredictable systems, the conditioning degrades exponentially as a function of the sequence length. We validate our claims through extensive experiments, providing practical guidance on when nonlinear dynamical systems can be efficiently parallelized, and highlighting predictability as a key design principle for parallelizable models.

artificial intelligence, convergence, optimization problem, (16 more...)

arXiv.org Machine Learning

2508.16817

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > California > Santa Clara County > Stanford (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(3 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)