legendre transform
A ADDITIONAL PROOFS 470 A.1 Proof of Lemma 1 (strongly convex case)
The major part of the proof is adapted from Muzellec et al. [ 2021, Lemma 3.1]. In bold, the highest accuracy after being calibrated with the semi-dual. Quadratic Problem and can be numerically solved with CVXPY for instance. Ax b K, with K a fixed cone to be compiled only once. SSNB The strong convexity parameter l is chosen in { 0 .
Neural Implicit Solution Formula for Efficiently Solving Hamilton-Jacobi Equations
This paper presents an implicit solution formula for the Hamilton-Jacobi partial differential equation (HJ PDE). The formula is derived using the method of characteristics and is shown to coincide with the Hopf and Lax formulas in the case where either the Hamiltonian or the initial function is convex. It provides a simple and efficient numerical approach for computing the viscosity solution of HJ PDEs, bypassing the need for the Legendre transform of the Hamiltonian or the initial condition, and the explicit computation of individual characteristic trajectories. A deep learning-based methodology is proposed to learn this implicit solution formula, leveraging the mesh-free nature of deep learning to ensure scalability for high-dimensional problems. Building upon this framework, an algorithm is developed that approximates the characteristic curves piecewise linearly for state-dependent Hamiltonians. Extensive experimental results demonstrate that the proposed method delivers highly accurate solutions, even for nonconvex Hamiltonians, and exhibits remarkable scalability, achieving computational efficiency for problems up to 40 dimensions.
A note on continuous-time online learning
In online learning, the data is provided in a sequential order, and the goal of the learner is to make online decisions to minimize overall regrets. This note is concerned with continuous-time models and algorithms for several online learning problems: online linear optimization, adversarial bandit, and adversarial linear bandit. For each problem, we extend the discrete-time algorithm to the continuous-time setting and provide a concise proof of the optimal regret bound.
A Simple and General Duality Proof for Wasserstein Distributionally Robust Optimization
Zhang, Luhao, Yang, Jincheng, Gao, Rui
We present an elementary yet general proof of duality for Wasserstein distributionally robust optimization. The duality holds for any arbitrary Kantorovich transport cost, measurable loss function, and nominal probability distribution, provided that an interchangeability principle holds, which is equivalent to certain measurability conditions. To illustrate the broader applicability of our approach, we provide a rigorous treatment of duality results in distributionally robust Markov decision processes and distributionally robust multistage stochastic programming. Furthermore, we extend the result to other problems including infinity-Wasserstein distributionally robust optimization, risk-averse optimization, and globalized distributionally robust counterpart.
A max-affine spline approximation of neural networks using the Legendre transform of a convex-concave representation
Perrett, Adam, Wood, Danny, Brown, Gavin
This work presents a novel algorithm for transforming a neural network into a spline representation. Unlike previous work that required convex and piecewise-affine network operators to create a max-affine spline alternate form, this work relaxes this constraint. The only constraint is that the function be bounded and possess a well-define second derivative, although this was shown experimentally to not be strictly necessary. It can also be performed over the whole network rather than on each layer independently. As in previous work, this bridges the gap between neural networks and approximation theory but also enables the visualisation of network feature maps. Mathematical proof and experimental investigation of the technique is performed with approximation error and feature maps being extracted from a range of architectures, including convolutional neural networks.
Non-negative matrix and tensor factorisations with a smoothed Wasserstein loss
Non-negative matrix and tensor factorisations are a classical tool in machine learning and data science for finding low-dimensional representations of high-dimensional datasets. In applications such as imaging, datasets can often be regarded as distributions in a space with metric structure. In such a setting, a Wasserstein loss function based on optimal transportation theory is a natural choice since it incorporates knowledge about the geometry of the underlying space. We introduce a general mathematical framework for computing non-negative factorisations of matrices and tensors with respect to an optimal transport loss, and derive an efficient method for its solution using a convex dual formulation. We demonstrate the applicability of this approach with several numerical examples.
Towards a mathematical theory of trajectory inference
Lavenant, Hugo, Zhang, Stephen, Kim, Young-Heon, Schiebinger, Geoffrey
We devise a theoretical framework and a numerical method to infer trajectories of a stochastic process from snapshots of its temporal marginals. This problem arises in the analysis of single cell RNA-sequencing data, which provide high dimensional measurements of cell states but cannot track the trajectories of the cells over time. We prove that for a class of stochastic processes it is possible to recover the ground truth trajectories from limited samples of the temporal marginals at each time-point, and provide an efficient algorithm to do so in practice. The method we develop, Global Waddington-OT (gWOT), boils down to a smooth convex optimization problem posed globally over all time-points involving entropy-regularized optimal transport. We demonstrate that this problem can be solved efficiently in practice and yields good reconstructions, as we show on several synthetic and real datasets.
Additivity of Information in Multilayer Networks via Additive Gaussian Noise Transforms
Multilayer (or deep) networks are powerful probabilistic models based on multiple stages of a linear transform followed by a non-linear (possibly random) function. In general, the linear transforms are defined by matrices and the non-linear functions are defined by information channels. These models have gained great popularity due to their ability to characterize complex probabilistic relationships arising in a wide variety of inference problems. The contribution of this paper is a new method for analyzing the fundamental limits of statistical inference in settings where the model is known. The validity of our method can be established in a number of settings and is conjectured to hold more generally. A key assumption made throughout is that the matrices are drawn randomly from orthogonally invariant distributions. Our method yields explicit formulas for 1) the mutual information; 2) the minimum mean-squared error (MMSE); 3) the existence and locations of certain phase-transitions with respect to the problem parameters; and 4) the stationary points for the state evolution of approximate message passing algorithms. When applied to the special case of models with multivariate Gaussian channels our method is rigorous and has close connections to free probability theory for random matrices. When applied to the general case of non-Gaussian channels, our method provides a simple alternative to the replica method from statistical physics. A key observation is that the combined effects of the individual components in the model (namely the matrices and the channels) are additive when viewed in a certain transform domain.
The Concave-Convex Procedure (CCCP)
Yuille, Alan L., Rangarajan, Anand
This paper describes a simple geometrical Concave-Convex procedure (CCCP) for constructing discrete time dynamical systems which can be guaranteed to decrease almost any global optimization/energy function (see technical conditions in section (2)). We prove that there is a relationship between CCCP and optimization techniques based on introducing auxiliary variables using Legendre transforms. We distinguish between Legendre min-max and Legendre minimization. In the former, see [6], the introduction of auxiliary variables converts the problem to a min-max problem where the goal is to find a saddle point. By contrast, in Legendre minimization, see [8], the problem remains a minimization one (and so it becomes easier to analyze convergence).