Goto

Collaborating Authors

 Fang, Cong


SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator

Neural Information Processing Systems

In this paper, we propose a new technique named \textit{Stochastic Path-Integrated Differential EstimatoR} (SPIDER), which can be used to track many deterministic quantities of interests with significantly reduced computational cost. Combining SPIDER with the method of normalized gradient descent, we propose SPIDER-SFO that solve non-convex stochastic optimization problems using stochastic gradients only. We provide a few error-bound results on its convergence rates. Specially, we prove that the SPIDER-SFO algorithm achieves a gradient computation cost of $\mathcal{O}\left( \min( n^{1/2} \epsilon^{-2}, \epsilon^{-3} ) \right)$ to find an $\epsilon$-approximate first-order stationary point. In addition, we prove that SPIDER-SFO nearly matches the algorithmic lower bound for finding stationary point under the gradient Lipschitz assumption in the finite-sum setting. Our SPIDER technique can be further applied to find an $(\epsilon, \mathcal{O}(\ep^{0.5}))$-approximate second-order stationary point at a gradient computation cost of $\tilde{\mathcal{O}}\left( \min( n^{1/2} \epsilon^{-2}+\epsilon^{-2.5}, \epsilon^{-3} ) \right)$.


SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator

Neural Information Processing Systems

In this paper, we propose a new technique named \textit{Stochastic Path-Integrated Differential EstimatoR} (SPIDER), which can be used to track many deterministic quantities of interests with significantly reduced computational cost. Combining SPIDER with the method of normalized gradient descent, we propose SPIDER-SFO that solve non-convex stochastic optimization problems using stochastic gradients only. We provide a few error-bound results on its convergence rates. Specially, we prove that the SPIDER-SFO algorithm achieves a gradient computation cost of $\mathcal{O}\left( \min( n^{1/2} \epsilon^{-2}, \epsilon^{-3} ) \right)$ to find an $\epsilon$-approximate first-order stationary point. In addition, we prove that SPIDER-SFO nearly matches the algorithmic lower bound for finding stationary point under the gradient Lipschitz assumption in the finite-sum setting. Our SPIDER technique can be further applied to find an $(\epsilon, \mathcal{O}(\ep^{0.5}))$-approximate second-order stationary point at a gradient computation cost of $\tilde{\mathcal{O}}\left( \min( n^{1/2} \epsilon^{-2}+\epsilon^{-2.5}, \epsilon^{-3} ) \right)$.


Hessian-Aware Zeroth-Order Optimization for Black-Box Adversarial Attack

arXiv.org Machine Learning

Zeroth-order optimization or derivative-free optimization is an important research topic in machine learning. In recent, it has become a key tool in black-box adversarial attack to neural network based image classifiers. However, existing zeroth-order optimization algorithms rarely extract Hessian information of the model function. In this paper, we utilize the second-order information of the objective function and propose a novel \emph{Hessian-aware zeroth-order algorithm} called \texttt{ZO-HessAware}. Our theoretical result shows that \texttt{ZO-HessAware} has an improved zeroth-order convergence rate and query complexity under structured Hessian approximation, where we propose a few approximation methods of such. Our empirical studies on the black-box adversarial attack problem validate that our algorithm can achieve improved success rates with a lower query complexity.


Lifted Proximal Operator Machines

arXiv.org Artificial Intelligence

We propose a new optimization method for training feed-forward neural networks. By rewriting the activation function as an equivalent proximal operator, we approximate a feed-forward neural network by adding the proximal operators to the objective function as penalties, hence we call the lifted proximal operator machine (LPOM). LPOM is block multi-convex in all layer-wise weights and activations. This allows us to use block coordinate descent to update the layer-wise weights and activations in parallel. Most notably, we only use the mapping of the activation function itself, rather than its derivatives, thus avoiding the gradient vanishing or blow-up issues in gradient based training methods. So our method is applicable to various non-decreasing Lipschitz continuous activation functions, which can be saturating and non-differentiable. LPOM does not require more auxiliary variables than the layer-wise activations, thus using roughly the same amount of memory as stochastic gradient descent (SGD) does. We further prove the convergence of updating the layer-wise weights and activations. Experiments on MNIST and CIFAR-10 datasets testify to the advantages of LPOM.


SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator

arXiv.org Machine Learning

In this paper, we propose a new technique named Stochastic Path-Integrated Differential EstimatoR (SPIDER), which can be used to track many deterministic quantities of interest with significantly reduced computational cost. Combining SPIDER with the method of normalized gradient descent, we propose two new algorithms, namely SPIDER-SFO and SPIDER-SSO, that solve non-convex stochastic optimization problems using stochastic gradients only. We provide sharp error-bound results on their convergence rates. Specially, we prove that the SPIDER-SFO and SPIDER-SSO algorithms achieve a record-breaking $\tilde{O}(\epsilon^{-3})$ gradient computation cost to find an $\epsilon$-approximate first-order and $(\epsilon, O(\epsilon^{0.5}))$-approximate second-order stationary point, respectively. In addition, we prove that SPIDER-SFO nearly matches the algorithmic lower bound for finding stationary point under the gradient Lipschitz assumption in the finite-sum setting.


Faster and Non-ergodic O(1/K) Stochastic Alternating Direction Method of Multipliers

Neural Information Processing Systems

We study stochastic convex optimization subjected to linear equality constraints. Traditional Stochastic Alternating Direction Method of Multipliers and its Nesterov's acceleration scheme can only achieve ergodic O(1/\sqrt{K}) convergence rates, where K is the number of iteration. By introducing Variance Reduction (VR) techniques, the convergence rates improve to ergodic O(1/K). In this paper, we propose a new stochastic ADMM which elaborately integrates Nesterov's extrapolation and VR techniques. With Nesterov’s extrapolation, our algorithm can achieve a non-ergodic O(1/K) convergence rate which is optimal for separable linearly constrained non-smooth convex problems, while the convergence rates of VR based ADMM methods are actually tight O(1/\sqrt{K}) in non-ergodic sense. To the best of our knowledge, this is the first work that achieves a truly accelerated, stochastic convergence rate for constrained convex problems. The experimental results demonstrate that our algorithm is significantly faster than the existing state-of-the-art stochastic ADMM methods.


Parallel Asynchronous Stochastic Variance Reduction for Nonconvex Optimization

AAAI Conferences

Nowadays, asynchronous parallel algorithms have received much attention in the optimization field due to the crucial demands for modern large-scale optimization problems. However, most asynchronous algorithms focus on convex problems. Analysis on nonconvex problems is lacking. For the Asynchronous Stochastic Descent (ASGD) algorithm, the best result from (Lian et al., 2015) can only achieve an asymptotic O(\frac{1}{\epsilon^2}) rate (convergence to the stationary points) on nonconvex problems. In this paper, we study Stochastic Variance Reduced Gradient (SVRG) in the asynchronous setting. We propose the Asynchronous Stochastic Variance Reduced Gradient (ASVRG) algorithm for nonconvex finite-sum problems. We develop two schemes for ASVRG, depending on whether the parameters are updated as an atom or not. We prove that both of the two schemes can achieve linear speed up (a non-asymptotic O(\frac{n^\frac{2}{3}}{\epsilon}) rate to the stationary points) for nonconvex problems when the delay parameter \tau\leq n^{\frac{1}{3}}, where n is the number of training samples. We also establish a non-asymptotic O(\frac{n^\frac{2}{3}\tau^\frac{1}{3}}{\epsilon}) rate (convergence to the stationary points) for our algorithm without assumptions on \tau. This further demonstrates that even with asynchronous updating, SVRG has less number of Incremental First-order Oracles (IFOs) compared with Stochastic Gradient Descent and Gradient Descent. We also experiment on a shared memory multi-core system to demonstrate the efficiency of our algorithm.