Goto

Collaborating Authors

 Mathematical & Statistical Methods


A mean-field theory of lazy training in two-layer neural nets: entropic regularization and controlled McKean-Vlasov dynamics

arXiv.org Machine Learning

We consider the problem of universal approximation of functions by two-layer neural nets with random weights that are "nearly Gaussian" in the sense of Kullback-Leibler divergence. This problem is motivated by recent works on lazy training, where the weight updates generated by stochastic gradient descent do not move appreciably from the i.i.d. Gaussian initialization. We first consider the mean-field limit, where the finite population of neurons in the hidden layer is replaced by a continual ensemble, and show that our problem can be phrased as global minimization of a free-energy functional on the space of probability measures over the weights. This functional trades off the $L^2$ approximation risk against the KL divergence with respect to a centered Gaussian prior. We characterize the unique global minimizer and then construct a controlled nonlinear dynamics in the space of probability measures over weights that solves a McKean--Vlasov optimal control problem. This control problem is closely related to the Schr\"odinger bridge (or entropic optimal transport) problem, and its value is proportional to the minimum of the free energy. Finally, we show that SGD in the lazy training regime (which can be ensured by jointly tuning the variance of the Gaussian prior and the entropic regularization parameter) serves as a greedy approximation to the optimal McKean--Vlasov distributional dynamics and provide quantitative guarantees on the $L^2$ approximation error.



SA vs SAA for population Wasserstein barycenter calculation

arXiv.org Machine Learning

In Machine Learning and Optimization community there are two main approaches for convex risk minimization problem. The first approach is Stochastic Averaging (SA) (online) and the second one is Stochastic Average Approximation (SAA) (Monte Carlo, Empirical Risk Minimization, offline) with proper regularization in non-strongly convex case. At the moment, it is known that both approaches are on average equivalent (up to a logarithmic factor) in terms of oracle complexity (required number of stochastic gradient evaluations). What is the situation with total complexity? The answer depends on specific problem. However, starting from work [Nemirovski et al. (2009)] it was generally accepted that SA is better than SAA. Nevertheless, in case of large-scale problems SA may ran out of memory problems since storing all data on one machine and organizing online access to it can be impossible without communications with other machines. SAA in contradistinction to SA allows parallel/distributed calculations. In this paper we show that SAA may outperform SA in the problem of calculating an estimation for population ({\mu}-entropy regularized) Wasserstein barycenter even for non-parallel (non-decenralized) set up.


Deep combinatorial optimisation for optimal stopping time problems and stochastic impulse control. Application to swing options pricing and fixed transaction costs options hedging

arXiv.org Machine Learning

American-style options are used not only by traditional asset managers but also by energy companies to hedge "optimised assets" by finding optimal decisions to optimise their P&L and find their value. A common modelling of a power plant unit P&L is done using swing options which are American options allowing to exercise at most l times the option with possibly a constraint on the delay between two exercise dates (see Carmona and Touzi (2008) or Warin (2012) for gas storage modelling). Formally, for T 0, we are given a stochastic processes ( X t) t 0 defined on a probability space (โ„ฆ, F, F ( F t) t 0, P) and one wants to find an increasing sequence of F stopping times ฯ„ ( ฯ„ 1,ฯ„ 2,...,ฯ„ l) that maximises the expectation of some objective function f E Pnull l null i 1f ( ฯ„ i,X ฯ„ i) 1 ฯ„ i Tnull . Numerical methods to solve the optimal stopping problem when l 1,f ( x,t) e rt g (x) and X is Markovian include: - Dynamic programming equation: the option price P 0 is computed using the following backward discrete scheme over a grid t 0 0 t 1 ... t N T: P t N g ( X T), P t i max( g ( X t i),e r (t i 1 t i) E P( P t i 1 F t i)), i 0,...,N 1 . One then needs to perform regression to compute the conditional expectations, see Longstaff and Schwartz (2001) or Bouchard and Warin (2012).


Transport Gaussian Processes for Regression

arXiv.org Machine Learning

Gaussian process (GP) priors are non-parametric generative models with appealing modelling properties for Bayesian inference: they can model non-linear relationships through noisy observations, have closed-form expressions for training and inference, and are governed by interpretable hyperparameters. However, GP models rely on Gaussianity, an assumption that does not hold in several real-world scenarios, e.g., when observations are bounded or have extreme-value dependencies, a natural phenomenon in physics, finance and social sciences. Although beyond-Gaussian stochastic processes have caught the attention of the GP community, a principled definition and rigorous treatment is still lacking. In this regard, we propose a methodology to construct stochastic processes, which include GPs, warped GPs, Student-t processes and several others under a single unified approach. We also provide formulas and algorithms for training and inference of the proposed models in the regression problem. Our approach is inspired by layers-based models, where each proposed layer changes a specific property over the generated stochastic process. That, in turn, allows us to push-forward a standard Gaussian white noise prior towards other more expressive stochastic processes, for which marginals and copulas need not be Gaussian, while retaining the appealing properties of GPs. We validate the proposed model through experiments with real-world data.


Fast Graph Metric Learning via Gershgorin Disc Alignment

arXiv.org Machine Learning

We propose a fast general projection-free metric learning framework, where the minimization objective $\min_{\M \in \cS} Q(\M)$ is a convex differentiable function of the metric matrix $\M$, and $\M$ resides in the set $\cS$ of generalized graph Laplacian matrices for connected graphs with positive edge weights and node degrees. Unlike low-rank metric matrices common in the literature, $\cS$ includes the important positive-diagonal-only matrices as a special case in the limit. The key idea for fast optimization is to rewrite the positive definite cone constraint in $\cS$ as signal-adaptive linear constraints via Gershgorin disc alignment, so that the alternating optimization of the diagonal and off-diagonal terms in $\M$ can be solved efficiently as linear programs via Frank-Wolfe iterations. We prove that the Gershgorin discs can be aligned perfectly using the first eigenvector $\v$ of $\M$, which we update iteratively using Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) with warm start as diagonal / off-diagonal terms are optimized. Experiments show that our efficiently computed graph metric matrices outperform metrics learned using competing methods in terms of classification tasks.


Variational Optimization on Lie Groups, with Examples of Leading (Generalized) Eigenvalue Problems

arXiv.org Machine Learning

The article considers smooth optimization of functions on Lie groups. By generalizing NAG variational principle in vector space (Wibisono et al., 2016) to Lie groups, continuous Lie-NAG dynamics which are guaranteed to converge to local optimum are obtained. They correspond to momentum versions of gradient flow on Lie groups. A particular case of $\mathsf{SO}(n)$ is then studied in details, with objective functions corresponding to leading Generalized EigenValue problems: the Lie-NAG dynamics are first made explicit in coordinates, and then discretized in structure preserving fashions, resulting in optimization algorithms with faithful energy behavior (due to conformal symplecticity) and exactly remaining on the Lie group. Stochastic gradient versions are also investigated. Numerical experiments on both synthetic data and practical problem (LDA for MNIST) demonstrate the effectiveness of the proposed methods as optimization algorithms ($not$ as a classification method).


Data Science Crash Course 3/10: Linear Algebra and Statistics

#artificialintelligence

This is the third instalment of Data Science Crash Course and today we're going to review mathematics needed for Data Science. Linear algebra is all about manipulations with vectors and matrices. It's both notation and useful way of manipulating object. You can perform operations on vectors like adding by adding each respective term -- they need to have the same length. You can multiply a vector by a scalar, that is a real number, by multiplying each of the entries by this real number.


Artificial Benchmark for Community Detection (ABCD): Fast Random Graph Model with Community Structure

arXiv.org Machine Learning

Most of the current complex networks that are of interest to practitioners possess a certain community structure that plays an important role in understanding the properties of these networks. Moreover, many machine learning algorithms and tools that are developed for complex networks try to take advantage of the existence of communities to improve their performance or speed. As a result, there are many competing algorithms for detecting communities in large networks. Unfortunately, these algorithms are often quite sensitive and so they cannot be fine-tuned for a given, but a constantly changing, real-world network at hand. It is therefore important to test these algorithms for various scenarios that can only be done using synthetic graphs that have built-in community structure, power-law degree distribution, and other typical properties observed in complex networks. The standard and extensively used method for generating artificial networks is the LFR graph generator. Unfortunately, this model has some scalability limitations and it is challenging to analyze it theoretically. Finally, the mixing parameter $\mu$, the main parameter of the model guiding the strength of the communities, has a non-obvious interpretation and so can lead to unnaturally-defined networks. In this paper, we provide an alternative random graph model with community structure and power-law distribution for both degrees and community sizes, the Artificial Benchmark for Community Detection (ABCD). We show that the new model solves the three issues identified above and more. The conclusion is that these models produce comparable graphs but ABCD is fast, simple, and can be easily tuned to allow the user to make a smooth transition between the two extremes: pure (independent) communities and random graph with no community structure.


Efficient Programmable Random Variate Generation Accelerator from Sensor Noise

arXiv.org Machine Learning

--We introduce a method for nonuniform random number generation based on sampling a physical process in a controlled environment. We demonstrate one proof-of-concept implementation of the method that reduces the error of Monte Carlo integration of a univariate Gaussian by 1068 while doubling the speed of the Monte Carlo simulation. We show that the supply voltage and temperature of the physical process must be controlled to prevent the mean and standard deviation of the random number generator from drifting. URRENT software-based methods of nonuniform random variate generation are slow and inefficient [1][2][3][4]. We present a programmable system capable of generating Gaussian random variates by extracting the noise properties of a MEMS sensor.