Optimization
Off-Policy Interval Estimation with Lipschitz Value Iteration
Reinforcement learning (RL) (e.g., Sutton & Barto, 1998) has become widely used in tasks like Li, 2016; Liu et al., 2018a), estimating the expected reward of a target policy using observational data gathered from previous behavior policies, therefore holds tremendous promise for designing Our method is efficient and provably convergent. Our work is closely related to the off-policy point estimation.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. Summary: The authors re-explain regularization in optimization problems as a constraint of the type the parameters ${\bf w}$ must belong to the convex set $O$ where the convex set O is obtained as the convex hull of all the points of the form $g.v$ where $v$ is some fix vector, $g$ an element from a group and $.$ is a (linear) group action of element $g$ on vector $v$. More concretely, their main contributions are as follows. For example, the ball associated to the L1 norm can be explained as the convex hull of the points obtained by flipping the sign and permuting the components of the vector $(1,0,0,..,0)$; (B) they show that given a seed $v$ and a group action associated to a group $G$, the notion of $w$ is a member of the convex set $O_G(v)$ can be seen as $v$ is smaller than $w$ under a pre-order; (C) they show that if $-v$ belongs to convex set $O$ then $O$ can be seen as the ball of an atomic norm (as defined in Chandra et al.); (D) they show that the L1-sorted norm equals the dual of the norm associated to the signed-pertumation orbitope; (E) they show how to reinterpret the main steps of conditional and projected gradient algorithms in the language of orbitopes and give a procedure to compute projections onto orbitopes. Quality: There are no technical mistakes in the paper.
Drop-Muon: Update Less, Converge Faster
Gruntkowska, Kaja, Maziane, Yassine, Qu, Zheng, Richtรกrik, Peter
Conventional wisdom in deep learning optimization dictates updating all layers at every step-a principle followed by all recent state-of-the-art optimizers such as Muon. In this work, we challenge this assumption, showing that full-network updates can be fundamentally suboptimal, both in theory and in practice. We introduce a non-Euclidean Randomized Progressive Training method-Drop-Muon-a simple yet powerful framework that updates only a subset of layers per step according to a randomized schedule, combining the efficiency of progressive training with layer-specific non-Euclidean updates for top-tier performance. We provide rigorous convergence guarantees under both layer-wise smoothness and layer-wise $(L^0, L^1)$-smoothness, covering deterministic and stochastic gradient settings, marking the first such results for progressive training in the stochastic and non-smooth regime. Our cost analysis further reveals that full-network updates are not optimal unless a very specific relationship between layer smoothness constants holds. Through controlled CNN experiments, we empirically demonstrate that Drop-Muon consistently outperforms full-network Muon, achieving the same accuracy up to $1.4\times$ faster in wall-clock time. Together, our results suggest a shift in how large-scale models can be efficiently trained, challenging the status quo and offering a highly efficient, theoretically grounded alternative to full-network updates.
Flatness-Aware Stochastic Gradient Langevin Dynamics
Bruno, Stefano, Hwang, Youngsik, An, Jaehyeon, Sabanis, Sotirios, Lim, Dong-Young
Generalization in deep learning is closely tied to the pursuit of flat minima in the loss landscape, yet classical Stochastic Gradient Langevin Dynamics (SGLD) offers no mechanism to bias its dynamics toward such low-curvature solutions. This work introduces Flatness-Aware Stochastic Gradient Langevin Dynamics (fSGLD), designed to efficiently and provably seek flat minima in high-dimensional nonconvex optimization problems. At each iteration, fSGLD uses the stochastic gradient evaluated at parameters perturbed by isotropic Gaussian noise, commonly referred to as Random Weight Perturbation (RWP), thereby optimizing a randomized-smoothing objective that implicitly captures curvature information. Leveraging these properties, we prove that the invariant measure of fSGLD stays close to a stationary measure concentrated on the global minimizers of a loss function regularized by the Hessian trace whenever the inverse temperature and the scale of random weight perturbation are properly coupled. This result provides a rigorous theoretical explanation for the benefits of random weight perturbation. In particular, we establish non-asymptotic convergence guarantees in Wasserstein distance with the best known rate and derive an excess-risk bound for the Hessian-trace regularized objective. Extensive experiments on noisy-label and large-scale vision tasks, in both training-from-scratch and fine-tuning settings, demonstrate that fSGLD achieves superior or comparable generalization and robustness to baseline algorithms while maintaining the computational cost of SGD, about half that of SAM. Hessian-spectrum analysis further confirms that fSGLD converges to significantly flatter minima.
Lower Bounds on Adversarial Robustness for Multiclass Classification with General Loss Functions
Trillos, Camilo Andrรฉs Garcรญa, Trillos, Nicolรกs Garcรญa
We consider adversarially robust classification in a multiclass setting under arbitrary loss functions and derive dual and barycentric reformulations of the corresponding learner-agnostic robust risk minimization problem. We provide explicit characterizations for important cases such as the cross-entropy loss, loss functions with a power form, and the quadratic loss, extending in this way available results for the 0-1 loss. These reformulations enable efficient computation of sharp lower bounds for adversarial risks and facilitate the design of robust classifiers beyond the 0-1 loss setting. Our paper uncovers interesting connections between adversarial robustness, $ฮฑ$-fair packing problems, and generalized barycenter problems for arbitrary positive measures where Kullback-Leibler and Tsallis entropies are used as penalties. Our theoretical results are accompanied with illustrative numerical experiments where we obtain tighter lower bounds for adversarial risks with the cross-entropy loss function.
Deep Hedging Under Non-Convexity: Limitations and a Case for AlphaZero
Maggiolo, Matteo, Nuti, Giuseppe, ล trupl, Miroslav, Szehr, Oleg
This paper examines replication portfolio construction in incomplete markets - a key problem in financial engineering with applications in pricing, hedging, balance sheet management, and energy storage planning. We model this as a two-player game between an investor and the market, where the investor makes strategic bets on future states while the market reveals outcomes. Inspired by the success of Monte Carlo Tree Search in stochastic games, we introduce an AlphaZero-based system and compare its performance to deep hedging - a widely used industry method based on gradient descent. Through theoretical analysis and experiments, we show that deep hedging struggles in environments where the $Q$-function is not subject to convexity constraints - such as those involving non-convex transaction costs, capital constraints, or regulatory limitations - converging to local optima. We construct specific market environments to highlight these limitations and demonstrate that AlphaZero consistently finds near-optimal replication strategies. On the theoretical side, we establish a connection between deep hedging and convex optimization, suggesting that its effectiveness is contingent on convexity assumptions. Our experiments further suggest that AlphaZero is more sample-efficient - an important advantage in data-scarce, overfitting-prone derivative markets.
A reproducible comparative study of categorical kernels for Gaussian process regression, with new clustering-based nested kernels
Perez, Raphaรซl Carpintero, Da Veiga, Sรฉbastien, Garnier, Josselin
Designing categorical kernels is a major challenge for Gaussian process regression with continuous and categorical inputs. Despite previous studies, it is difficult to identify a preferred method, either because the evaluation metrics, the optimization procedure, or the datasets change depending on the study. In particular, reproducible code is rarely available. The aim of this paper is to provide a reproducible comparative study of all existing categorical kernels on many of the test cases investigated so far. We also propose new evaluation metrics inspired by the optimization community, which provide quantitative rankings of the methods across several tasks. From our results on datasets which exhibit a group structure on the levels of categorical inputs, it appears that nested kernels methods clearly outperform all competitors. When the group structure is unknown or when there is no prior knowledge of such a structure, we propose a new clustering-based strategy using target encodings of categorical variables. We show that on a large panel of datasets, which do not necessarily have a known group structure, this estimation strategy still outperforms other approaches while maintaining low computational cost.
Enhancing Electricity-System Resilience with Adaptive Robust Optimization and Conformal Uncertainty Characterization
Chen, Shuyi, Zhu, Shixiang, Sioshansi, Ramteen
Extreme weather is straining electricity systems, exposing the limitations of reactive responses, and prompting the need for proactive resilience planning. Most existing approaches to enhance electricity system resilience employ simplified uncertainty models and decouple proactive and reactive decisions. This paper proposes a novel tri-level optimization model that integrates proactive actions, adversarial disruptions, and reactive responses. Conformal prediction is used to construct distribution-free system-disruption uncertainty sets with coverage guarantees. The tri-level problem is solved by using duality theory to derive a bi-level reformulation and employing Bender's decomposition. Numerical experiments demonstrate that our approach outperforms conventional robust and two-stage methods.