Optimization
Primal-Dual Neural Algorithmic Reasoning
Neural Algorithmic Reasoning (NAR) trains neural networks to simulate classical algorithms, enabling structured and interpretable reasoning over complex data. While prior research has predominantly focused on learning exact algorithms for polynomial-time-solvable problems, extending NAR to harder problems remains an open challenge. In this work, we introduce a general NAR framework grounded in the primal-dual paradigm, a classical method for designing efficient approximation algorithms. By leveraging a bipartite representation between primal and dual variables, we establish an alignment between primal-dual algorithms and Graph Neural Networks. Furthermore, we incorporate optimal solutions from small instances to greatly enhance the model's reasoning capabilities. Our empirical results demonstrate that our model not only simulates but also outperforms approximation algorithms for multiple tasks, exhibiting robust generalization to larger and out-of-distribution graphs. Moreover, we highlight the framework's practical utility by integrating it with commercial solvers and applying it to real-world datasets.
Leveraging machine learning features for linear optical interferometer control
Kuzmin, Sergei S., Dyakonov, Ivan V., Straupe, Stanislav S.
We have developed an algorithm that constructs a model of a reconfigurable optical interferometer, independent of specific architectural constraints. The programming of unitary transformations on the interferometer's optical modes relies on either an analytical method for deriving the unitary matrix from a set of phase shifts or an optimization routine when such decomposition is not available. Our algorithm employs a supervised learning approach, aligning the interferometer model with a training set derived from the device being studied. A straightforward optimization procedure leverages this trained model to determine the phase shifts of the interferometer with a specific architecture, obtaining the required unitary transformation. This approach enables the effective tuning of interferometers without requiring a precise analytical solution, paving the way for the exploration of new interferometric circuit architectures.
Adaptive Deadline and Batch Layered Synchronized Federated Learning
Goren, Asaf, Lang, Natalie, Shlezinger, Nir, Cohen, Alejandro
Federated learning (FL) enables collaborative model training across distributed edge devices while preserving data privacy, and typically operates in a round-based synchronous manner. However, synchronous FL suffers from latency bottlenecks due to device heterogeneity, where slower clients (stragglers) delay or degrade global updates. Prior solutions, such as fixed deadlines, client selection, and layer-wise partial aggregation, alleviate the effect of stragglers, but treat round timing and local workload as static parameters, limiting their effectiveness under strict time constraints. We propose ADEL-FL, a novel framework that jointly optimizes per-round deadlines and user-specific batch sizes for layer-wise aggregation. Our approach formulates a constrained optimization problem minimizing the expected L2 distance to the global optimum under total training time and global rounds. We provide a convergence analysis under exponential compute models and prove that ADEL-FL yields unbiased updates with bounded variance. Extensive experiments demonstrate that ADEL-FL outperforms alternative methods in both convergence rate and final accuracy under heterogeneous conditions.
Improved Approximations for Hard Graph Problems using Predictions
Aamand, Anders, Chen, Justin Y., Gollapudi, Siddharth, Silwal, Sandeep, Wu, Hao
We design improved approximation algorithms for NP-hard graph problems by incorporating predictions (e.g., learned from past data). Our prediction model builds upon and extends the $\varepsilon$-prediction framework by Cohen-Addad, d'Orsi, Gupta, Lee, and Panigrahi (NeurIPS 2024). We consider an edge-based version of this model, where each edge provides two bits of information, corresponding to predictions about whether each of its endpoints belong to an optimal solution. Even with weak predictions where each bit is only $\varepsilon$-correlated with the true solution, this information allows us to break approximation barriers in the standard setting. We develop algorithms with improved approximation ratios for MaxCut, Vertex Cover, Set Cover, and Maximum Independent Set problems (among others). Across these problems, our algorithms share a unifying theme, where we separately satisfy constraints related to high degree vertices (using predictions) and low-degree vertices (without using predictions) and carefully combine the answers.
TSENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks
Meng, Xiang, Makni, Mehdi, Mazumder, Rahul
Network pruning reduces the computational requirements of large neural networks, with N:M sparsity -- retaining only N out of every M consecutive weights -- offering a compelling balance between compressed model quality and hardware acceleration. However, N:M sparsity only accelerates forward-pass computations, as N:M patterns are not preserved during matrix transposition, limiting efficiency during training where both passes are computationally intensive. While transposable N:M sparsity has been proposed to address this limitation, existing methods for finding transposable N:M sparse masks either fail to scale to large models or are restricted to M=4 which results in suboptimal compression-accuracy trade-off. We introduce an efficient solver for transposable N:M masks that scales to billion-parameter models. We formulate mask generation as optimal transport problems and solve through entropy regularization and Dykstra's algorithm, followed by a rounding procedure. Our tensor-based implementation exploits GPU parallelism, achieving up to 100x speedup with only 1-10% error compared to existing methods. Our approach can be integrated with layer-wise N:M pruning frameworks including Wanda, SparseGPT and ALPS to produce transposable N:M sparse models with arbitrary N:M values. Experiments show that LLaMA3.2-8B with transposable 16:32 sparsity maintains performance close to its standard N:M counterpart and outperforms standard 2:4 sparse model, showing the practical value of our approach.
Review for NeurIPS paper: A Feasible Level Proximal Point Method for Nonconvex Sparse Constrained Optimization
Summary and Contributions: UPDATE AFTER REBUTTAL Thank you for your response. The authors have agreed to clarify the motivation for the use of non-convex constraints and will be precise about the use of the word "suboptimal". They have agreed to state that they have considerably more knowledge on the specific kind of non-convexity they are dealing with when comparing with prior works. As a consequence I have increased my score, although I believe the experimental section remains rather unconvincing. This paper proposes a method based on a sequence of convex approximations to solve optimization problems with a non-convex sparsity constraint.
Review for NeurIPS paper: Convex optimization based on global lower second-order models
Summary and Contributions: This paper presents new second-order algorithms for the prototypical composite convex optimization problem. First, the paper introduces a new global second-order lower approximation. Based on this lower approximation model, the paper introduces a new second-order optimization algorithm. The proposed method successively minimizes the lower approximation model of the smooth term augmented by the nonsmooth term and constructs the solution to the original problem as a convex combination of the solutions to these subproblems. In this regard, it can be seen as a second-order modification of the Frank-Wolfe method, which considers a second-order lower model instead of the first-order.
Gradient Methods with Online Scaling Part I. Theoretical Foundations
Gao, Wenzhi, Chu, Ya-Chi, Ye, Yinyu, Udell, Madeleine
This paper establishes the theoretical foundations of the online scaled gradient methods (OSGM), a framework that utilizes online learning to adapt stepsizes and provably accelerate first-order methods. OSGM quantifies the effectiveness of a stepsize by a feedback function motivated from a convergence measure and uses the feedback to adjust the stepsize through an online learning algorithm. Consequently, instantiations of OSGM achieve convergence rates that are asymptotically no worse than the optimal stepsize. OSGM yields desirable convergence guarantees on smooth convex problems, including 1) trajectory-dependent global convergence on smooth convex objectives; 2) an improved complexity result on smooth strongly convex problems, and 3) local superlinear convergence. Notably, OSGM constitutes a new family of first-order methods with non-asymptotic superlinear convergence, joining the celebrated quasi-Newton methods. Finally, OSGM explains the empirical success of the popular hypergradient-descent heuristic in optimization for machine learning.