Goto

Collaborating Authors

 Mathematical & Statistical Methods


A Stochastic Extra-Step Quasi-Newton Method for Nonsmooth Nonconvex Optimization

arXiv.org Machine Learning

In this paper, a novel stochastic extra-step quasi-Newton method is developed to solve a class of nonsmooth nonconvex composite optimization problems. We assume that the gradient of the smooth part of the objective function can only be approximated by stochastic oracles. The proposed method combines general stochastic higher order steps derived from an underlying proximal type fixed-point equation with additional stochastic proximal gradient steps to guarantee convergence. Based on suitable bounds on the step sizes, we establish global convergence to stationary points in expectation and an extension of the approach using variance reduction techniques is discussed. Motivated by large-scale and big data applications, we investigate a stochastic coordinate-type quasi-Newton scheme that allows to generate cheap and tractable stochastic higher order directions. Finally, the proposed algorithm is tested on large-scale logistic regression and deep learning problems and it is shown that it compares favorably with other state-of-the-art methods.


Implementation of a modified Nesterov's Accelerated quasi-Newton Method on Tensorflow

arXiv.org Machine Learning

The Nesterov's Accelerated Quasi-Newton (NAQ) method has shown to drastically improve the convergence speed compared to the conventional quasi-Newton method. This paper implements NAQ for non-convex optimization on T ensorflow. Two modifications have been proposed to the original NAQ algorithm to ensure global convergence and eliminate linesearch. The performance of the proposed algorithm - mNAQ is evaluated on standard non-convex function approximation benchmark problems and microwave circuit modelling problems. The results show that the improved algorithm converges better and faster compared to first order optimizers such as AdaGrad, RMSProp, Adam, and the second order methods such as the quasi-Newton method. Index T erms --Neural networks, training algorithm, quasi-Newton method, Nesterov's accelerated gradient, Global convergence, T ensorflow, highly-nonlinear function modeling.


Identification of Model Uncertainty via Optimal Design of Experiments applied to a Mechanical Press

arXiv.org Machine Learning

In engineering applications almost all processes are described with the aid of models. Especially forming machines heavily rely on mathematical models for control and condition monitoring. Inaccuracies during the modeling, manufacturing and assembly of these machines induce model uncertainty which impairs the controller's performance. In this paper we propose an approach to identify model uncertainty using parameter identification and optimal design of experiments. The experimental setup is characterized by optimal sensor positions such that specific model parameters can be determined with minimal variance. This allows for the computation of confidence regions, in which the real parameters or the parameter estimates from different test sets have to lie. We claim that inconsistencies in the estimated parameter values, considering their approximated confidence ellipsoids as well, cannot be explained by data or parameter uncertainty but are indicators of model uncertainty. The proposed method is demonstrated using a component of the 3D Servo Press, a multi-technology forming machine that combines spindles with eccentric servo drives.


A Stochastic Variance Reduced Nesterov's Accelerated Quasi-Newton Method

arXiv.org Machine Learning

--Recently algorithms incorporating second order curvature information have become popular in training neural networks. The Nesterov's Accelerated Quasi-Newton (NAQ) method has shown to effectively accelerate the BFGS quasi-Newton method by incorporating the momentum term and Nesterov's accelerated gradient vector . A stochastic version of NAQ method was proposed for training of large-scale problems. This paper proposes a stochastic variance reduced Nesterov's Accelerated Quasi-Newton method in full (SVR-NAQ) and limited (SVR-LNAQ) memory forms. The performance of the proposed method is evaluated in T ensorflow on four benchmark problems - two regression and two classification problems respectively. The results show improved performance compared to conventional methods.


Nonasymptotic estimates for Stochastic Gradient Langevin Dynamics under local conditions in nonconvex optimization

arXiv.org Machine Learning

Within the context of empirical risk minimization, see Raginsky, Rakhlin, and Telgarsky (2017), we are concerned with a non-asymptotic analysis of sampling algorithms used in optimization. In particular, we obtain non-asymptotic error bounds for a popular class of algorithms called Stochastic Gradient Langevin Dynamics (SGLD). These results are derived in appropriate Wasserstein distances in the absence of the log-concavity of the target distribution. More precisely, the local Lipschitzness of the stochastic gradient $H(\theta, x)$ is assumed, and furthermore, the dissipativity and convexity at infinity condition are relaxed by removing the uniform dependence in $x$.


Basic Linear Algebra for Deep Learning

#artificialintelligence

The concepts of Linear Algebra are crucial for understanding the theory behind Machine Learning, especially for Deep Learning. They give you better intuition for how algorithms really work under the hood, which enables you to make better decisions. So if you really want to be a professional in this field, you cannot escape mastering some of its concepts. This post will give you an introduction to the most important concepts of Linear Algebra that are used in Machine Learning. Linear Algebra is a continuous form of mathematics and is applied throughout science and engineering because it allows you to model natural phenomena and to compute them efficiently.


Random Quadratic Forms with Dependence: Applications to Restricted Isometry and Beyond

arXiv.org Machine Learning

Several important families of computational and statistical results in machine learning and randomized algorithms rely on uniform bounds on quadratic forms of random vectors or matrices. Such results include the Johnson-Lindenstrauss (J-L) Lemma, the Restricted Isometry Property (RIP), randomized sketching algorithms, and approximate linear algebra. The existing results critically depend on statistical independence, e.g., independent entries for random vectors, independent rows for random matrices, etc., which prevent their usage in dependent or adaptive modeling settings. In this paper, we show that such independence is in fact not needed for such results which continue to hold under fairly general dependence structures. In particular, we present uniform bounds on random quadratic forms of stochastic processes which are conditionally independent and sub-Gaussian given another (latent) process. Our setup allows general dependencies of the stochastic process on the history of the latent process and the latent process to be influenced by realizations of the stochastic process. The results are thus applicable to adaptive modeling settings and also allows for sequential design of random vectors and matrices. We also discuss stochastic process based forms of J-L, RIP, and sketching, to illustrate the generality of the results.


One Sample Stochastic Frank-Wolfe

arXiv.org Machine Learning

One of the beauties of the projected gradient descent method lies in its rather simple mechanism and yet stable behavior with inexact, stochastic gradients, which has led to its wide-spread use in many machine learning applications. However, once we replace the projection operator with a simpler linear program, as is done in the Frank-Wolfe method, both simplicity and stability take a serious hit. The aim of this paper is to bring them back without sacrificing the efficiency. In this paper, we propose the first one-sample stochastic Frank-Wolfe algorithm, called 1-SFW, that avoids the need to carefully tune the batch size, step size, learning rate, and other complicated hyper parameters. In particular, 1-SFW achieves the optimal convergence rate of $\mathcal{O}(1/\epsilon^2)$ for reaching an $\epsilon$-suboptimal solution in the stochastic convex setting, and a $(1-1/e)-\epsilon$ approximate solution for a stochastic monotone DR-submodular maximization problem. Moreover, in a general non-convex setting, 1-SFW finds an $\epsilon$-first-order stationary point after at most $\mathcal{O}(1/\epsilon^3)$ iterations, achieving the current best known convergence rate. All of this is possible by designing a novel unbiased momentum estimator that governs the stability of the optimization process while using a single sample at each iteration.


On Dimension-free Tail Inequalities for Sums of Random Matrices and Applications

arXiv.org Machine Learning

In this paper, we present a new framework to obtain tail inequalities for sums of random matrices. Compared with existing works, our tail inequalities have the following characteristics: 1) high feasibility--they can be used to study the tail behavior of various matrix functions, e.g., arbitrary matrix norms, the absolute value of the sum of the sum of the $j$ largest singular values (resp. eigenvalues) of complex matrices (resp. Hermitian matrices); and 2) independence of matrix dimension --- they do not have the matrix-dimension term as a product factor, and thus are suitable to the scenario of high-dimensional or infinite-dimensional random matrices. The price we pay to obtain these advantages is that the convergence rate of the resulting inequalities will become slow when the number of summand random matrices is large. We also develop the tail inequalities for matrix random series and matrix martingale difference sequence. We also demonstrate usefulness of our tail bounds in several fields. In compressed sensing, we employ the resulted tail inequalities to achieve a proof of the restricted isometry property when the measurement matrix is the sum of random matrices without any assumption on the distributions of matrix entries. In probability theory, we derive a new upper bound to the supreme of stochastic processes. In machine learning, we prove new expectation bounds of sums of random matrices matrix and obtain matrix approximation schemes via random sampling. In quantum information, we show a new analysis relating to the fractional cover number of quantum hypergraphs. In theoretical computer science, we obtain randomness-efficient samplers using matrix expander graphs that can be efficiently implemented in time without dependence on matrix dimensions.


Change Detection in Noisy Dynamic Networks: A Spectral Embedding Approach

arXiv.org Machine Learning

Change detection in dynamic networks is an important problem in many areas, such as fraud detection, cyber intrusion detection and health care monitoring. It is a challenging problem because it involves a time sequence of graphs, each of which is usually very large and sparse with heterogeneous vertex degrees, resulting in a complex, high dimensional mathematical object. Spectral embedding methods provide an effective way to transform a graph to a lower dimensional latent Euclidean space that preserves the underlying structure of the network. Although change detection methods that use spectral embedding are available, they do not address sparsity and degree heterogeneity that usually occur in noisy real-world graphs and a majority of these methods focus on changes in the behaviour of the overall network. In this paper, we adapt previously developed techniques in spectral graph theory and propose a novel concept of applying Procrustes techniques to embedded points for vertices in a graph to detect changes in entity behaviour. Our spectral embedding approach not only addresses sparsity and degree heterogeneity issues, but also obtains an estimate of the appropriate embedding dimension. We call this method CDP (change detection using Procrustes analysis). We demonstrate the performance of CDP through extensive simulation experiments and a real-world application. CDP successfully detects various types of vertex-based changes including (i) changes in vertex degree, (ii) changes in community membership of vertices, and (iii) unusual increase or decrease in edge weight between vertices. The change detection performance of CDP is compared with two other baseline methods that employ alternative spectral embedding approaches. In both cases, CDP generally shows superior performance.