Goto

Collaborating Authors

 proofoftheorem3


Classical and Quantum Speedups for Non-Convex Optimization via Energy Conserving Descent

Sun, Yihang, Wang, Huaijin, Hayden, Patrick, Blanchet, Jose

arXiv.org Machine Learning

The Energy Conserving Descent (ECD) algorithm was recently proposed (De Luca & Silverstein, 2022) as a global non-convex optimization method. Unlike gradient descent, appropriately configured ECD dynamics escape strict local minima and converge to a global minimum, making it appealing for machine learning optimization. We present the first analytical study of ECD, focusing on the one-dimensional setting for this first installment. We formalize a stochastic ECD dynamics (sECD) with energy-preserving noise, as well as a quantum analog of the ECD Hamiltonian (qECD), providing the foundation for a quantum algorithm through Hamiltonian simulation. For positive double-well objectives, we compute the expected hitting time from a local to the global minimum. We prove that both sECD and qECD yield exponential speedup over respective gradient descent baselines--stochastic gradient descent and its quantization. For objectives with tall barriers, qECD achieves a further speedup over sECD.


Fair regression under localized demographic parity constraints

Charpentier, Arthur, Denis, Christophe, Elie, Romuald, Hebiri, Mohamed, HU, François

arXiv.org Machine Learning

Demographic parity (DP) is a widely used group fairness criterion requiring predictive distributions to be invariant across sensitive groups. While natural in classification, full distributional DP is often overly restrictive in regression and can lead to substantial accuracy loss. We propose a relaxation of DP tailored to regression, enforcing parity only at a finite set of quantile levels and/or score thresholds. Concretely, we introduce a novel (${\ell}$, Z)-fair predictor, which imposes groupwise CDF constraints of the form F f |S=s (z m ) = ${\ell}$ m for prescribed pairs (${\ell}$ m , z m ). For this setting, we derive closed-form characterizations of the optimal fair discretized predictor via a Lagrangian dual formulation and quantify the discretization cost, showing that the risk gap to the continuous optimum vanishes as the grid is refined. We further develop a model-agnostic post-processing algorithm based on two samples (labeled for learning a base regressor and unlabeled for calibration), and establish finite-sample guarantees on constraint violation and excess penalized risk. In addition, we introduce two alternative frameworks where we match group and marginal CDF values at selected score thresholds. In both settings, we provide closed-form solutions for the optimal fair discretized predictor. Experiments on synthetic and real datasets illustrate an interpretable fairness-accuracy trade-off, enabling targeted corrections at decision-relevant quantiles or thresholds while preserving predictive performance.


How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Zeyuan Allen-Zhu

Neural Information Processing Systems

However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when f(x) is convex. If f(x) is convex, to find a point with gradient norm ε, we design an algorithm SGD3withanear-optimalrate eO(ε 2),improvingthebestknownrateO(ε 8/3) of [17].


f3d9de86462c28781cbe5c47ef22c3e5-Supplemental.pdf

Neural Information Processing Systems

The algorithm [62] consider Algorithm 2 for the stochastic generalized linear bandit problem. Assume thatθ is the true parameter of the reward model. Then we consider the lower bounds. For fj(A) = 12(ej1eTj2 +ej2eTj1),A with j1 j2, fj(Ai) is only 1 wheni = j and 0 otherwise. With Claim D.12 and Claim D.11 we get that g C q To get 1), we writeVl = [v1, vl] Rd l and V l = [vl+1, vk].


Andforanyα>0,theLaplaciansatisfies Gλ c1 2α Gλ+2Md whereM =c23α/c1+c22. 2. IfG (x)=kg (x)k2 isC1,then k G (x)k c2kg (x)k,g (x) > G (x) c1G (x). Proof. Claim1.Note Gλ =2 2Fλgλandthat 1 2 c1I 2Fλ= mX

Neural Information Processing Systems

As a remark, using the Krylov-Bogoliubov existence theorem (see Corollary 11.8 of [6]), fixed points to(4)exist as long as one can show{ρt,t 0}istight. The learning rate is set differently foreachtask. Obviously, the HV indicator (Eq.(10)) can also be used as an objective function for optimizing solution sets. For example, [25, 7] greedily add new points to obtain the highest expected HV improvement. However, the landscape of the HV indicator is piece-wise constant (similar to the 0-1 loss in classification) and is difficult to optimize with gradient descent. Particularly, for all the dominated points inthe solution set, their gradient iszero.



210b7ec74fc9cec6fb8388dbbdaf23f7-Supplemental.pdf

Neural Information Processing Systems

Let Z be the set of all indices` [n] such thatα[`] 6= 0. Let Z be the set of all indices` Z such that the`th variable is constrained to be integral.