lipschitz continuity
Adversarial Generalization of Unfolding (Model-based) Networks
Unfolding networks are interpretable networks emerging from iterative algorithms, incorporate prior knowledge of data structure, and are designed to solve inverse problems like compressed sensing, which deals with recovering data from noisy, missing observations. Compressed sensing finds applications in critical domains, from medical imaging to cryptography, where adversarial robustness is crucial to prevent catastrophic failures. However, a solid theoretical understanding of the performance of unfolding networks in the presence of adversarial attacks is still in its infancy. In this paper, we study the adversarial generalization of unfolding networks when perturbed with l2-norm constrained attacks, generated by the fast gradient sign method. Particularly, we choose a family of state-ofthe-art overaparameterized unfolding networks and deploy a new framework to estimate their adversarial Rademacher complexity. Given this estimate, we provide adversarial generalization error bounds for the networks under study, which are tight with respect to the attack level. To our knowledge, this is the first theoretical analysis on the adversarial generalization of unfolding networks. We further present a series of experiments on real-world data, with results corroborating our derived theory, consistently for all data. Finally, we observe that the family's overparameterization can be exploited to promote adversarial robustness, shedding light on how to efficiently robustify neural networks.
AComputationally Viable Numerical Gradient-based Technique for Optimal Covering Problems
The problem of optimally covering a given compact subset of RN with a preassigned number n of Euclidean metric balls has a long-standing history and it is well-recognized to be computationally hard. This article establishes a numerically viable algorithm for obtaining optimal covers of compact sets via two key contributions. The first is a foundational result establishing Lipschitz continuity of the marginal function of a certain parametric non-convex maximization problem in the optimal covering problem, and it provides the substrate for numerical gradient algorithms to be employed in this context. The second is an adaptation of a stochastically smoothed numerical gradient-based (zeroth-order) algorithm for a non-convex minimization problem, that, equipped with randomized restarts, spurs global convergence to an optimal cover. Several numerical experiments with complicated nonconvex compact sets demonstrate the excellent performance of our techniques.
Beyond Lipschitz: Data-Driven Robustness via Discrete Modulus of Continuity
Dölz, Jürgen, Multerer, Michael, Palma, Michele
Robustness of neural networks is commonly quantified via local or global Lipschitz constants. However, Lipschitz continuity can be overly coarse or overly restrictive as global robustness measure, failing to capture nuanced, data-dependent behavior. We propose a data-driven, architecture-agnostic framework based on the discrete modulus of continuity (DMOC), a non linear generalization of Lipschitz continuity that provides a finer notion of robustness. Unlike many existing approaches, DMOC does not require access to model internals and instead evaluates regularity relative to the data distribution. This shifts the focus from the model to the data, which provide a data-driven baseline of regularity against which the network's robustness is assessed. We establish convergence results for DMOC-induced seminorms with explicit data-driven rates in terms of the separation distance, and introduce a scalable minibatch algorithm that reduces the quadratic cost of exact computation, enabling application to large-scale data sets such as ImageNet. Empirically, DMOC serves as an architecture independent diagnostic: it distinguishes trained from untrained networks, reveals underfitting and overfitting regimes, and yields, as a special case, tight Lipschitz estimates comparable to state-of-the-art method such as ECLipsE and ECLipsE-fast.
Bilevel Optimization over Saddle Points of Zero-Sum Markov Games
Zheng, Zihao, King, Irwin, Lu, Songtao
Reinforcement learning (RL) often has a hierarchical structure, where an upper-level (UL) learner selects model parameters and a lower-level (LL) decision-making process responds, naturally leading to a bilevel optimization problem. Most existing bilevel RL methods assume a single-policy LL Markov decision process (MDP), and therefore fail to capture competitive structures arising in applications such as incentive design, where multiple policies interact. We study bilevel optimization problems in which the LL problem is a regularized min-max zero-sum Markov game and the UL objective is optimized through the saddle-point equilibrium induced by the LL game. In this work, we propose penalty-augmented Nikaido-Isoda descent-ascent (PANDA), a penalty-based first-order policy-gradient method based on the Nikaido-Isoda function. By exploiting the min-max game structure, PANDA avoids computing UL hypergradients and does not require second-order information. We prove that PANDA converges to stationary points without convexity assumptions on either the UL or LL objectives. Moreover, PANDA reaches an $ε$-stationary point in $\tilde{\mathcal{O}}(ε^{-1})$ iterations with sample complexity $\tilde{\mathcal{O}}(ε^{-3})$, matching the best-known rates for bilevel RL with single-policy LL MDPs. Experiments demonstrate the superior performance of PANDA over closely related baselines.
Uniform Scaling Limits in AdamW-Trained Transformers
Gibson, William, Reisinger, Christoph
We study the large-depth limit of transformers trained with AdamW, by modelling the hidden-state dynamics as an interacting particle system (IPS) coupled through the attention mechanism. Under appropriate scaling of the attention heads, we prove that the joint dynamics of the hidden states and backpropagated variables converge in $L^2$, uniformly over the initial condition, to the solution of a forward--backward system of ODEs at rate $\mathcal O(L^{-1}+L^{-1/3}H^{-1/2})$. Here, $L$ and $H$ denote the depth and number of heads of the transformer, respectively. The limiting system of ODEs can be identified with a McKean--Vlasov ODE (MVODE) when the attention heads do not incorporate causal masking. By using the flow maps associated with this MVODE and applying concentration of measure techniques, we obtain bounds on the difference between the discrete and continuous models that are uniform over compact sets of initial conditions. As this is achieved without resorting to a covering argument, the constants in our bounds are independent of the number of tokens. Furthermore, under a suitable adaptation to AdamW, the bounds become independent of the token embedding dimension.
Ratio-based Loss Functions
Helgerth, Lena, Christmann, Andreas
Algorithms in machine learning and AI do critically depend on at least three key components: (i) the risk function, which is the expectation of the loss function, (ii) the function space, which is often called the hypothesis space, and (iii) the set of probability measures, which are allowed for the specified algorithm. This paper gives a survey of a certain class of loss functions, which we call ratio-based. In supervised learning, margin-based loss functions for classification tasks depending on the product of the output values $y_i$ and the predictions $f(x_i)$ as well as distance-based loss functions depending on the difference of $y_i$ and $f(x_i)$ for regression are common. Distance-based loss functions are in particular useful, if an additive model assumption seems plausible, i.e. the common signal plus noise assumption. However, in the literature, several loss functions proposed for regression purposes have a multiplicative error structure in mind and pay attention to relative errors, i.e. to the ratio of $y_i$ and $f(x_i)$. In this survey article, we systematically investigate such ratio-based loss functions and propose a few new losses, which may be interesting for future research. We concentrate on investigating general properties of ratio-based loss functions like continuity, Lipschitz-continuity, convexity, and differentiability, because these properties play a central role in most machine learning algorithms. Therefore, we do not focus on some specific machine learning algorithm to derive universal consistency, learning rates, or stability results. Instead, we want to enable future research in this direction.
Fractal Landscapes in Policy Optimization
Policy gradient lies at the core of deep reinforcement learning (RL) in continuous domains. Despite much success, it is often observed in practice that RL training with policy gradient can fail for many reasons, even on standard control problems with known solutions. We propose a framework for understanding one inherent limitation of the policy gradient approach: the optimization landscape in the policy space can be extremely non-smooth or fractal for certain classes of MDPs, such that there does not exist gradient to be estimated in the first place. We draw on techniques from chaos theory and non-smooth analysis, and analyze the maximal Lyapunov exponents and Hölder exponents of the policy optimization objectives. Moreover, we develop a practical method that can estimate the local smoothness of objective function from samples to identify when the training process has encountered fractal landscapes. We show experiments to illustrate how some failure cases of policy optimization can be explained by such fractal landscapes.