Goto

Collaborating Authors

 Liang, Yingbin


Critical Points of Neural Networks: Analytical Forms and Landscape Properties

arXiv.org Machine Learning

Due to the success of deep learning to solving a variety of challenging machine learning tasks, there is a rising interest in understanding loss functions for training neural networks from a theoretical aspect. Particularly, the properties of critical points and the landscape around them are of importance to determine the convergence performance of optimization algorithms. In this paper, we provide full (necessary and sufficient) characterization of the analytical forms for the critical points (as well as global minimizers) of the square loss functions for various neural networks. We show that the analytical forms of the critical points characterize the values of the corresponding loss functions as well as the necessary and sufficient conditions to achieve global minimum. Furthermore, we exploit the analytical forms of the critical points to characterize the landscape properties for the loss functions of these neural networks. One particular conclusion is that: The loss function of linear networks has no spurious local minimum, while the loss function of one-hidden-layer nonlinear networks with ReLU activation function does have local minimum that is not global minimum.


Characterization of Gradient Dominance and Regularity Conditions for Neural Networks

arXiv.org Machine Learning

The past decade has witnessed a successful application of deep learning to solving many challenging problems in machine learning and artificial intelligence. However, the loss functions of deep neural networks (especially nonlinear networks) are still far from being well understood from a theoretical aspect. In this paper, we enrich the current understanding of the landscape of the square loss functions for three types of neural networks. Specifically, when the parameter matrices are square, we provide an explicit characterization of the global minimizers for linear networks, linear residual networks, and nonlinear networks with one hidden layer. Then, we establish two quadratic types of landscape properties for the square loss of these neural networks, i.e., the gradient dominance condition within the neighborhood of their full rank global minimizers, and the regularity condition along certain directions and within the neighborhood of their global minimizers. These two landscape properties are desirable for the optimization around the global minimizers of the loss function for these neural networks.


Nonconvex Low-Rank Matrix Recovery with Arbitrary Outliers via Median-Truncated Gradient Descent

arXiv.org Machine Learning

Recent work has demonstrated the effectiveness of gradient descent for directly recovering the factors of low-rank matrices from random linear measurements in a globally convergent manner when initialized properly. However, the performance of existing algorithms is highly sensitive in the presence of outliers that may take arbitrary values. In this paper, we propose a truncated gradient descent algorithm to improve the robustness against outliers, where the truncation is performed to rule out the contributions of samples that deviate significantly from the {\em sample median} of measurement residuals adaptively in each iteration. We demonstrate that, when initialized in a basin of attraction close to the ground truth, the proposed algorithm converges to the ground truth at a linear rate for the Gaussian measurement model with a near-optimal number of measurements, even when a constant fraction of the measurements are arbitrarily corrupted. In addition, we propose a new truncated spectral method that ensures an initialization in the basin of attraction at slightly higher requirements. We finally provide numerical experiments to validate the superior performance of the proposed approach.


Median-Truncated Nonconvex Approach for Phase Retrieval with Outliers

arXiv.org Machine Learning

This paper investigates the phase retrieval problem, which aims to recover a signal from the magnitudes of its linear measurements. We develop statistically and computationally efficient algorithms for the situation when the measurements are corrupted by sparse outliers that can take arbitrary values. We propose a novel approach to robustify the gradient descent algorithm by using the sample median as a guide for pruning spurious samples in initialization and local search. Adopting the Poisson loss and the reshaped quadratic loss respectively, we obtain two algorithms termed median-TWF and median-RWF, both of which provably recover the signal from a near-optimal number of measurements when the measurement vectors are composed of i.i.d. Gaussian entries, up to a logarithmic factor, even when a constant fraction of the measurements are adversarially corrupted. We further show that both algorithms are stable in the presence of additional dense bounded noise. Our analysis is accomplished by developing non-trivial concentration results of median-related quantities, which may be of independent interest. We provide numerical experiments to demonstrate the effectiveness of our approach.


Reshaped Wirtinger Flow for Solving Quadratic System of Equations

Neural Information Processing Systems

We study the problem of recovering a vector $\bx\in \bbR^n$ from its magnitude measurements $y_i=|\langle \ba_i, \bx\rangle|, i=1,..., m$. Our work is along the line of the Wirtinger flow (WF) approach \citet{candes2015phase}, which solves the problem by minimizing a nonconvex loss function via a gradient algorithm and can be shown to converge to a global optimal point under good initialization. In contrast to the smooth loss function used in WF, we adopt a nonsmooth but lower-order loss function, and design a gradient-like algorithm (referred to as reshaped-WF). We show that for random Gaussian measurements, reshaped-WF enjoys geometric convergence to a global optimal point as long as the number $m$ of measurements is at the order of $\cO(n)$, where $n$ is the dimension of the unknown $\bx$. This improves the sample complexity of WF, and achieves the same sample complexity as truncated-WF \citet{chen2015solving} but without truncation at gradient step. Furthermore, reshaped-WF costs less computationally than WF, and runs faster numerically than both WF and truncated-WF. Bypassing higher-order variables in the loss function and truncations in the gradient loop, analysis of reshaped-WF is simplified.


Nonparametric Detection of Anomalous Data Streams

arXiv.org Machine Learning

A nonparametric anomalous hypothesis testing problem is investigated, in which there are totally n sequences with s anomalous sequences to be detected. Each typical sequence contains m independent and identically distributed (i.i.d.) samples drawn from a distribution p, whereas each anomalous sequence contains m i.i.d. samples drawn from a distribution q that is distinct from p. The distributions p and q are assumed to be unknown in advance. Distribution-free tests are constructed using maximum mean discrepancy as the metric, which is based on mean embeddings of distributions into a reproducing kernel Hilbert space. The probability of error is bounded as a function of the sample size m, the number s of anomalous sequences and the number n of sequences. It is then shown that with s known, the constructed test is exponentially consistent if m is greater than a constant factor of log n, for any p and q, whereas with s unknown, m should has an order strictly greater than log n. Furthermore, it is shown that no test can be consistent for arbitrary p and q if m is less than a constant factor of log n, thus the order-level optimality of the proposed test is established. Numerical results are provided to demonstrate that our tests outperform (or perform as well as) the tests based on other competitive approaches under various cases.


Reshaped Wirtinger Flow and Incremental Algorithm for Solving Quadratic System of Equations

arXiv.org Machine Learning

We study the phase retrieval problem, which solves quadratic system of equations, i.e., recovers a vector $\boldsymbol{x}\in \mathbb{R}^n$ from its magnitude measurements $y_i=|\langle \boldsymbol{a}_i, \boldsymbol{x}\rangle|, i=1,..., m$. We develop a gradient-like algorithm (referred to as RWF representing reshaped Wirtinger flow) by minimizing a nonconvex nonsmooth loss function. In comparison with existing nonconvex Wirtinger flow (WF) algorithm \cite{candes2015phase}, although the loss function becomes nonsmooth, it involves only the second power of variable and hence reduces the complexity. We show that for random Gaussian measurements, RWF enjoys geometric convergence to a global optimal point as long as the number $m$ of measurements is on the order of $n$, the dimension of the unknown $\boldsymbol{x}$. This improves the sample complexity of WF, and achieves the same sample complexity as truncated Wirtinger flow (TWF) \cite{chen2015solving}, but without truncation in gradient loop. Furthermore, RWF costs less computationally than WF, and runs faster numerically than both WF and TWF. We further develop the incremental (stochastic) reshaped Wirtinger flow (IRWF) and show that IRWF converges linearly to the true signal. We further establish performance guarantee of an existing Kaczmarz method for the phase retrieval problem based on its connection to IRWF. We also empirically demonstrate that IRWF outperforms existing ITWF algorithm (stochastic version of TWF) as well as other batch algorithms.


Nonparametric Detection of Geometric Structures over Networks

arXiv.org Machine Learning

Nonparametric detection of existence of an anomalous structure over a network is investigated. Nodes corresponding to the anomalous structure (if one exists) receive samples generated by a distribution q, which is different from a distribution p generating samples for other nodes. If an anomalous structure does not exist, all nodes receive samples generated by p. It is assumed that the distributions p and q are arbitrary and unknown. The goal is to design statistically consistent tests with probability of errors converging to zero as the network size becomes asymptotically large. Kernel-based tests are proposed based on maximum mean discrepancy that measures the distance between mean embeddings of distributions into a reproducing kernel Hilbert space. Detection of an anomalous interval over a line network is first studied. Sufficient conditions on minimum and maximum sizes of candidate anomalous intervals are characterized in order to guarantee the proposed test to be consistent. It is also shown that certain necessary conditions must hold to guarantee any test to be universally consistent. Comparison of sufficient and necessary conditions yields that the proposed test is order-level optimal and nearly optimal respectively in terms of minimum and maximum sizes of candidate anomalous intervals. Generalization of the results to other networks is further developed. Numerical results are provided to demonstrate the performance of the proposed tests.


Analysis of Robust PCA via Local Incoherence

Neural Information Processing Systems

We investigate the robust PCA problem of decomposing an observed matrix into the sum of a low-rank and a sparse error matrices via convex programming Principal Component Pursuit (PCP). In contrast to previous studies that assume the support of the error matrix is generated by uniform Bernoulli sampling, we allow non-uniform sampling, i.e., entries of the low-rank matrix are corrupted by errors with unequal probabilities. We characterize conditions on error corruption of each individual entry based on the local incoherence of the low-rank matrix, under which correct matrix decomposition by PCP is guaranteed. Such a refined analysis of robust PCA captures how robust each entry of the low rank matrix combats error corruption. In order to deal with non-uniform error corruption, our technical proof introduces a new weighted norm and develops/exploits the concentration properties that such a norm satisfies.


Sharp Threshold for Multivariate Multi-Response Linear Regression via Block Regularized Lasso

arXiv.org Machine Learning

In this paper, we investigate a multivariate multi-response (MVMR) linear regression problem, which contains multiple linear regression models with differently distributed design matrices, and different regression and output vectors. The goal is to recover the support union of all regression vectors using $l_1/l_2$-regularized Lasso. We characterize sufficient and necessary conditions on sample complexity \emph{as a sharp threshold} to guarantee successful recovery of the support union. Namely, if the sample size is above the threshold, then $l_1/l_2$-regularized Lasso correctly recovers the support union; and if the sample size is below the threshold, $l_1/l_2$-regularized Lasso fails to recover the support union. In particular, the threshold precisely captures the impact of the sparsity of regression vectors and the statistical properties of the design matrices on sample complexity. Therefore, the threshold function also captures the advantages of joint support union recovery using multi-task Lasso over individual support recovery using single-task Lasso.