AITopics | Optimization

Collaborating Authors

Optimization

News Overviews Instructional Materials AI-Alerts Classics

Convergence Bound and Critical Batch Size of Muon Optimizer

Sato, Naoki, Naganuma, Hiroki, Iiduka, Hideaki

arXiv.org Artificial IntelligenceNov-24-2025

Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. Our analysis covers the standard configuration using both, thereby elucidating its real-world performance. We then demonstrate that the addition of weight decay yields strictly tighter theoretical bounds and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive the critical batch size for Muon that minimizes the computational cost of training. Our analysis identifies the hyperparameters governing this value, and our experiments validate the corresponding theoretical findings across workloads including image classification and language modeling task.

artificial intelligence, batch size, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2507.01598

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)

Add feedback

Soft decision trees for survival analysis

Consolo, Antonio, Amaldi, Edoardo, Carrizosa, Emilio

arXiv.org Artificial IntelligenceNov-24-2025

Decision trees are popular in survival analysis for their interpretability and ability to model complex relationships. Survival trees, which predict the timing of singular events using censored historical data, are typically built through heuristic approaches. Recently, there has been growing interest in globally optimized trees, where the overall tree is trained by minimizing the error function over all its parameters. We propose a new soft survival tree model (SST), with a soft splitting rule at each branch node, trained via a nonlinear optimization formulation amenable to decomposition. Since SSTs provide for every input vector a specific survival function associated to a single leaf node, they satisfy the conditional computation property and inherit the related benefits. SST and the training formulation combine flexibility with interpretability: any smooth survival function (parametric, semiparametric, or nonparametric) estimated through maximum likelihood can be used, and each leaf node of an SST yields a cluster of distinct survival functions which are associated to the data points routed to it. Numerical experiments on 15 well-known datasets show that SSTs, with parametric and spline-based semiparametric survival functions, trained using an adaptation of the node-based decomposition algorithm proposed by Consolo et al. (2024) for soft regression trees, outperform three benchmark survival trees in terms of four widely-used discrimination and calibration measures. SSTs can also be extended to consider group fairness.

artificial intelligence, machine learning, optimization problem, (19 more...)

arXiv.org Artificial Intelligence

2506.16846

Country:

Europe (1.00)
North America > United States (0.67)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Banking & Finance (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
(3 more...)

Add feedback

Efficient Penalty-Based Bilevel Methods: Improved Analysis, Novel Updates, and Flatness Condition

Jiang, Liuyuan, Xiao, Quan, Chen, Lisha, Chen, Tianyi

arXiv.org Machine LearningNov-24-2025

Penalty-based methods have become popular for solving bilevel optimization (BLO) problems, thanks to their effective first-order nature. However, they often require inner-loop iterations to solve the lower-level (LL) problem and small outer-loop step sizes to handle the increased smoothness induced by large penalty terms, leading to suboptimal complexity. This work considers the general BLO problems with coupled constraints (CCs) and leverages a novel penalty reformulation that decouples the upper- and lower-level variables. This yields an improved analysis of the smoothness constant, enabling larger step sizes and reduced iteration complexity for Penalty-Based Gradient Descent algorithms in ALTernating fashion (ALT-PBGD). Building on the insight of reduced smoothness, we propose PBGD-Free, a novel fully single-loop algorithm that avoids inner loops for the uncoupled constraint BLO. For BLO with CCs, PBGD-Free employs an efficient inner-loop with substantially reduced iteration complexity. Furthermore, we propose a novel curvature condition describing the "flatness" of the upper-level objective with respect to the LL variable. This condition relaxes the traditional upper-level Lipschitz requirement, enables smaller penalty constant choices, and results in a negligible penalty gradient term during upper-level variable updates. We provide rigorous convergence analysis and validate the method's efficacy through hyperparameter optimization for support vector machines and fine-tuning of large language models.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Machine Learning

2511.16796

Country: North America > United States > New York (0.28)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.67)
(2 more...)

Add feedback

Learning Chordal Markov Networks via Branch and Bound

Neural Information Processing SystemsNov-21-2025, 16:11:23 GMT

We present a new algorithmic approach for the task of finding a chordal Markov network structure that maximizes a given scoring function. The algorithm is based on branch and bound and integrates dynamic programming for both domain pruning and for obtaining strong bounds for search-space pruning. Empirically, we show that the approach dominates in terms of running times a recent integer programming approach (and thereby also a recent constraint optimization approach) for the problem.

branch and bound, learning chordal markov network, name change, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.66)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.66)

Add feedback

Learning Combinatorial Optimization Algorithms over Graphs

Neural Information Processing SystemsNov-21-2025, 16:07:22 GMT

The design of good heuristics or approximation algorithms for NP-hard combinatorial optimization problems often requires significant specialized knowledge and trial-and-error. Can we automate this challenging, tedious process, and learn the algorithms instead? In many real-world applications, it is typically the case that the same optimization problem is solved again and again on a regular basis, maintaining the same problem structure but differing in the data. This provides an opportunity for learning heuristic algorithms that exploit the structure of such recurring problems. In this paper, we propose a unique combination of reinforcement learning and graph embedding to address this challenge. The learned greedy policy behaves like a meta-algorithm that incrementally constructs a solution, and the action is determined by the output of a graph embedding network capturing the current state of the solution. We show that our framework can be applied to a diverse range of optimization problems over graphs, and learns effective algorithms for the Minimum Vertex Cover, Maximum Cut and Traveling Salesman problems.

electronic proceedings, learning combinatorial optimization algorithm, name change, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)

Add feedback

Integration Methods and Optimization Algorithms

Neural Information Processing SystemsNov-21-2025, 16:01:36 GMT

We show that accelerated optimization methods can be seen as particular instances of multi-step integration schemes from numerical analysis, applied to the gradient flow equation. Compared with recent advances in this vein, the differential equation considered here is the basic gradient flow, and we derive a class of multi-step schemes which includes accelerated algorithms, using classical conditions from numerical analysis. Multi-step schemes integrate the differential equation using larger step sizes, which intuitively explains the acceleration phenomenon.

electronic proceedings, integration method and optimization algorithm, name change, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.44)

Add feedback

Kernel Feature Selection via Conditional Covariance Minimization

Neural Information Processing SystemsNov-21-2025, 15:58:53 GMT

We propose a method for feature selection that employs kernel-based measures of independence to find a subset of covariates that is maximally predictive of the response. Building on past work in kernel dimension reduction, we show how to perform feature selection via a constrained optimization problem involving the trace of the conditional covariance operator. We prove various consistency results for this procedure, and also demonstrate that our method compares favorably with other state-of-the-art algorithms on a variety of synthetic and real data sets.

conditional covariance minimization, kernel feature selection, name change, (1 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Jordan (0.09)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.63)

Add feedback

Scalable Planning with Tensorflow for Hybrid Nonlinear Domains

Neural Information Processing SystemsNov-21-2025, 15:51:20 GMT

Given recent deep learning results that demonstrate the ability to effectively optimize high-dimensional non-convex functions with gradient descent optimization on GPUs, we ask in this paper whether symbolic gradient optimization tools such as Tensorflow can be effective for planning in hybrid (mixed discrete and continuous) nonlinear domains with high dimensional state and action spaces? To this end, we demonstrate that hybrid planning with Tensorflow and RMSProp gradient descent is competitive with mixed integer linear program (MILP) based optimization on piecewise linear planning domains (where we can compute optimal solutions) and substantially outperforms state-of-the-art interior point methods for nonlinear planning domains. Furthermore, we remark that Tensorflow is highly scalable, converging to a strong plan on a large-scale concurrent domain with a total of 576,000 continuous action parameters distributed over a horizon of 96 time steps and 100 parallel instances in only 4 minutes. We provide a number of insights that clarify such strong performance including observations that despite long horizons, RMSProp avoids both the vanishing and exploding gradient problems. Together these results suggest a new frontier for highly scalable planning in nonlinear hybrid domains by leveraging GPUs and the power of recent advances in gradient descent with highly optimized toolkits like Tensorflow.

name change, scalable planning, tensorflow, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.60)

Add feedback

Inference in Graphical Models via Semidefinite Programming Hierarchies

Neural Information Processing SystemsNov-21-2025, 15:48:35 GMT

Popular inference algorithms such as belief propagation (BP) and generalized belief propagation (GBP) are intimately related to linear programming (LP) relaxation within the Sherali-Adams hierarchy. Despite the popularity of these algorithms, it is well understood that the Sum-of-Squares (SOS) hierarchy based on semidefinite programming (SDP) can provide superior guarantees. Unfortunately, SOS relaxations for a graph with $n$ vertices require solving an SDP with $n^{\Theta(d)}$ variables where $d$ is the degree in the hierarchy. In practice, for $d\ge 4$, this approach does not scale beyond a few tens of variables. In this paper, we propose binary SDP relaxations for MAP inference using the SOS hierarchy with two innovations focused on computational efficiency. Firstly, in analogy to BP and its variants, we only introduce decision variables corresponding to contiguous regions in the graphical model. Secondly, we solve the resulting SDP using a non-convex Burer-Monteiro style method, and develop a sequential rounding procedure. We demonstrate that the resulting algorithm can solve problems with tens of thousands of variables within minutes, and outperforms BP and GBP on practical problems such as image denoising and Ising spin glasses. Finally, for specific graph types, we establish a sufficient condition for the tightness of the proposed partial SOS relaxation.

graphical model, name change, semidefinite programming hierarchy, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.39)

Add feedback

Online Dynamic Programming

Neural Information Processing SystemsNov-21-2025, 15:42:28 GMT

We consider the problem of repeatedly solving a variant of the same dynamic programming problem in successive trials. An instance of the type of problems we consider is to find a good binary search tree in a changing environment. At the beginning of each trial, the learner probabilistically chooses a tree with the n keys at the internal nodes and the n + 1 gaps between keys at the leaves. The learner is then told the frequencies of the keys and gaps and is charged by the average search cost for the chosen tree. The problem is online because the frequencies can change between trials. The goal is to develop algorithms with the property that their total average search cost (loss) in all trials is close to the total loss of the best tree chosen in hindsight for all trials. The challenge, of course, is that the algorithm has to deal with exponential number of trees. We develop a general methodology for tackling such problems for a wide class of dynamic programming algorithms. Our framework allows us to extend online learning algorithms like Hedge and Component Hedge to a significantly wider class of combinatorial objects than was possible before.

algorithm, name change, online dynamic programming, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.90)
Information Technology > Artificial Intelligence > Machine Learning (0.76)

Add feedback