Goto

Collaborating Authors

 assumption 3


Zeroth-Order Optimization Finds Flat Minima

Neural Information Processing Systems

Zeroth-order methods are extensively used in machine learning applications where gradients are infeasible or expensive to compute, such as black-box attacks, reinforcement learning, and language model fine-tuning. Existing optimization theory focuses on convergence to an arbitrary stationary point, but less is known on the implicit regularization that provides a fine-grained characterization on which particular solutions are finally reached. We show that zeroth-order optimization with the standard two-point estimator favors solutions with small trace of Hessian, which is widely used in previous work to distinguish between sharp and flat minima. We further provide convergence rates of zeroth-order optimization to approximate flat minima for convex and sufficiently smooth functions, where flat minima are defined as the minimizers that achieve the smallest trace of Hessian among all optimal solutions.


Just One Layer Norm Guarantees Stable Extrapolation

Neural Information Processing Systems

In spite of their prevalence, the behaviour of Neural Networks when extrapolating far from the training distribution remains poorly understood, with existing results limited to specific cases. In this work, we prove general results--the first of their kind--by applying Neural Tangent Kernel (NTK) theory to analyse infinitelywide neural networks trained until convergence and prove that the inclusion of just one Layer Norm (LN) fundamentally alters the induced NTK, transforming it into a bounded-variance kernel. As a result, the output of an infinitely wide network with at least one LN remains bounded, even on inputs far from the training data. In contrast, we show that a broad class of networks without LN can produce pathologically large outputs for certain inputs. We support these theoretical findings with empirical experiments on finite-width networks, demonstrating that while standard NNs often exhibit uncontrolled growth outside the training domain, a single LN layer effectively mitigates this instability. Finally, we explore real-world implications of this extrapolatory stability, including applications to predicting residue sizes in proteins larger than those seen during training and estimating age from facial images of underrepresented ethnicities absent from the training set.


Adaptive Riemannian ADMM for Nonsmooth Optimization: Optimal Complexity without Smoothing

Neural Information Processing Systems

We study the problem of minimizing the sum of a smooth function and a nonsmooth convex regularizer over a compact Riemannian submanifold embedded in Euclidean space. By introducing an auxiliary splitting variable, we propose an adaptive Riemannian alternating direction method of multipliers (ARADMM), which, for the first time, achieves convergence without requiring smoothing of the nonsmooth term. Our approach involves only one Riemannian gradient evaluation and one proximal update per iteration. Through careful and adaptive coordination of the stepsizes and penalty parameters, we establish an optimal iteration complexity of order O(ฯต 3) for finding an ฯต-approximate KKT point, matching the complexity of existing smoothing technique-based Riemannian ADMM methods. Extensive numerical experiments on sparse PCA and robust subspace recovery demonstrate that our ARADMM consistently outperforms state-of-the-art Riemannian ADMM variants in convergence speed and solution quality.


On Group Sufficiency Under Label Bias

Neural Information Processing Systems

Real-world classification datasets often contain label bias, where observed labels differ systematically from the true labels at different rates for different demographic groups. Machine learning models trained on such datasets may then exhibit disparities in predictive performance across these groups. In this work, we characterize the problem of learning fair classification models with respect to the underlying ground truth labels when given only label biased data. We focus on the particular fairness definition of group sufficiency, i.e. equal calibration of risk scores across protected groups. We theoretically show that enforcing fairness with respect to label biased data necessarily results in group miscalibration with respect to the true labels. We then propose a regularizer which minimizes an upper bound on the sufficiency gap by penalizing a conditional mutual information term. Across experiments on eight tabular, image, and text datasets with both synthetic and real label noise, we find that our method reduces the sufficiency gap by up to 7.2% with no significant decrease in overall accuracy.


Convergence Rates of Constrained Expected Improvement

Neural Information Processing Systems

Constrained Bayesian optimization (CBO) methods have seen significant success in black-box optimization with constraints. One of the most commonly used CBO methods is the constrained expected improvement (CEI) algorithm. CEI is a natural extension of expected improvement (EI) when constraints are incorporated. However, the theoretical convergence rate of CEI has not been established. In this work, we study the convergence rate of CEI by analyzing its simple regret upper bound.


Nearly Dimension-Independent Convergence of Mean-Field Black-Box Variational Inference

Neural Information Processing Systems

We prove that, given a mean-field location-scale variational family, black-box variational inference (BBVI) with the reparametrization gradient converges at a rate that is nearly independent of any explicit dimension dependence. Specifically, for a d-dimensional strongly log-concave and log-smooth target, the number of iterations for BBVI with a sub-Gaussian family to obtain a solution ฯต-close to the global optimum has an explicit dimension dependence no larger than O(logd). This is a significant improvement over the O(d)dependence of full-rank locationscale families. For heavy-tailed families, we prove a weaker O(d2/k)dependence, where kis the number of finite moments of the family. Additionally, if the Hessian of the target log-density is constant, the complexity is free of any explicit dimension dependence. We also prove that our bound on the gradient variance, which is key to our result, cannot be improved using only spectral bounds on the Hessian of the target log-density.


Non-convex entropic mean-field optimization via Best Response flow

Neural Information Processing Systems

We study the problem of minimizing non-convex functionals on the space of probability measures, regularized by the relative entropy (KL divergence) with respect to a fixed reference measure, as well as the corresponding problem of solving entropy-regularized non-convex-non-concave min-max problems. We utilize the Best Response flow (also known in the literature as the fictitious play flow) and study how its convergence is influenced by the relation between the degree of non-convexity of the functional under consideration, the regularization parameter and the tail behaviour of the reference measure. In particular, we demonstrate how to choose the regularizer, given the non-convex functional, so that the Best Response operator becomes a contraction with respect to the L1Wasserstein distance, which ensures the existence of its unique fixed point that is then shown to be the unique global minimizer for our optimization problem. This extends recent results where the Best Response flow was applied to solve convex optimization problems regularized by the relative entropy with respect to arbitrary reference measures, and with arbitrary values of the regularization parameter. Our results explain precisely how the assumption of convexity can be relaxed, at the expense of making a specific choice of the regularizer. Additionally, we demonstrate how these results can be applied in reinforcement learning in the context of policy optimization for Markov Decision Processes and Markov games with softmax parametrized policies in the mean-field regime.


Adam Reduces a Unique Form of Sharpness: Theoretical Insights Near the Minimizer Manifold

Neural Information Processing Systems

Despite the popularity of the Adam optimizer in practice, most theoretical analyses study Stochastic Gradient Descent (SGD) as a proxy for Adam, and little is known about how the solutions found by Adam differ. In this paper, we show that Adam implicitly reduces a unique form of sharpness measure shaped by its adaptive updates, leading to qualitatively different solutions from SGD. More specifically, when the training loss is small, Adam wanders around the manifold of minimizers and takes semi-gradients to minimize this sharpness measure in an adaptive manner, a behavior we rigorously characterize through a continuous-time approximation using stochastic differential equations. We further demonstrate how this behavior differs from that of SGD in a well-studied setting: when training overparameterized models with label noise, SGD has been shown to minimize the trace of the Hessian matrix, tr(H), whereas we prove that Adam minimizes tr(Diag(H)1/2) instead. In solving sparse linear regression with diagonal linear networks, this distinction enables Adam to achieve better sparsity and generalization than SGD. Finally, our analysis framework extends beyond Adam to a broad class of adaptive gradient methods, including RMSProp, Adam-mini, Adalayer and Shampoo, and provides a unified perspective on how these adaptive optimizers reduce sharpness, which we hope will offer insights for future optimizer design.


Sample complexity of data-driven tuning of model hyperparameters in neural networks with structured parameter-dependent dual function

Neural Information Processing Systems

Modern machine learning algorithms, especially deep learning-based techniques, typically involve careful hyperparameter tuning to achieve the best performance. Despite the surge of intense interest in practical techniques like Bayesian optimization and random search-based approaches to automating this laborious and compute-intensive task, the fundamental learning-theoretic complexity of tuning hyperparameters for deep neural networks is poorly understood. Inspired by this glaring gap, we initiate the formal study of hyperparameter tuning complexity in deep learning through a recently introduced data-driven setting. We assume that we have a series of learning tasks, and we have to tune hyperparameters to do well on average over the distribution of tasks. A major difficulty is that the utility function as a function of the hyperparameter is very volatile and furthermore, it is given implicitly by an optimization problem over the model parameters. To tackle this challenge, we introduce a new technique to characterize the discontinuities and oscillations of the utility function on any fixed problem instance as we vary the hyperparameter; our analysis relies on subtle concepts including tools from algebraic geometry, differential geometry and constrained optimization. We use this to show that the learning-theoretic complexity of the corresponding family of utility functions is bounded. We instantiate our results and provide sample complexity bounds for concrete applications--tuning a hyperparameter that interpolates neural activation functions and setting the kernel parameter in graph neural networks.


From stability of Langevin diffusion to convergence of proximal MCMC for non-log-concave sampling

Neural Information Processing Systems

We consider the problem of sampling distributions stemming from non-convex potentials with Unadjusted Langevin Algorithm (ULA). We prove the stability of the discrete-time ULA to drift approximations under the assumption that the potential is strongly convex at infinity.