Goto

Collaborating Authors

 Gradient Descent


A Variance-Reduced Stochastic Gradient Tracking Algorithm for Decentralized Optimization with Orthogonality Constraints

arXiv.org Artificial Intelligence

Decentralized optimization with orthogonality constraints is found widely in scientific computing and data science. Since the orthogonality constraints are nonconvex, it is quite challenging to design efficient algorithms. Existing approaches leverage the geometric tools from Riemannian optimization to solve this problem at the cost of high sample and communication complexities. To relieve this difficulty, based on two novel techniques that can waive the orthogonality constraints, we propose a variance-reduced stochastic gradient tracking (VRSGT) algorithm with the convergence rate of $O(1 / k)$ to a stationary point. To the best of our knowledge, VRSGT is the first algorithm for decentralized optimization with orthogonality constraints that reduces both sampling and communication complexities simultaneously. In the numerical experiments, VRSGT has a promising performance in a real-world autonomous driving application.


HAT4RD: Hierarchical Adversarial Training for Rumor Detection on Social Media

arXiv.org Artificial Intelligence

With the development of social media, social communication has changed. While this facilitates people's communication and access to information, it also provides an ideal platform for spreading rumors. In normal or critical situations, rumors will affect people's judgment and even endanger social security. However, natural language is high-dimensional and sparse, and the same rumor may be expressed in hundreds of ways on social media. As such, the robustness and generalization of the current rumor detection model are put into question. We proposed a novel \textbf{h}ierarchical \textbf{a}dversarial \textbf{t}raining method for \textbf{r}umor \textbf{d}etection (HAT4RD) on social media. Specifically, HAT4RD is based on gradient ascent by adding adversarial perturbations to the embedding layers of post-level and event-level modules to deceive the detector. At the same time, the detector uses stochastic gradient descent to minimize the adversarial risk to learn a more robust model. In this way, the post-level and event-level sample spaces are enhanced, and we have verified the robustness of our model under a variety of adversarial attacks. Moreover, visual experiments indicate that the proposed model drifts into an area with a flat loss landscape, leading to better generalization. We evaluate our proposed method on three public rumors datasets from two commonly used social platforms (Twitter and Weibo). Experiment results demonstrate that our model achieves better results than state-of-the-art methods.


Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Convex Decentralized Optimization

arXiv.org Artificial Intelligence

We study stochastic decentralized optimization for the problem of training machine learning models with large-scale distributed data. We extend the widely used EXTRA and DIGing methods with variance reduction (VR), and propose two methods: VR-EXTRA and VR-DIGing. The proposed VR-EXTRA requires the time of $O((\kappa_s+n)\log\frac{1}{\epsilon})$ stochastic gradient evaluations and $O((\kappa_b+\kappa_c)\log\frac{1}{\epsilon})$ communication rounds to reach precision $\epsilon$, which are the best complexities among the non-accelerated gradient-type methods, where $\kappa_s$ and $\kappa_b$ are the stochastic condition number and batch condition number for strongly convex and smooth problems, respectively, $\kappa_c$ is the condition number of the communication network, and $n$ is the sample size on each distributed node. The proposed VR-DIGing has a little higher communication cost of $O((\kappa_b+\kappa_c^2)\log\frac{1}{\epsilon})$. Our stochastic gradient computation complexities are the same as the ones of single-machine VR methods, such as SAG, SAGA, and SVRG, and our communication complexities keep the same as those of EXTRA and DIGing, respectively. To further speed up the convergence, we also propose the accelerated VR-EXTRA and VR-DIGing with both the optimal $O((\sqrt{n\kappa_s}+n)\log\frac{1}{\epsilon})$ stochastic gradient computation complexity and $O(\sqrt{\kappa_b\kappa_c}\log\frac{1}{\epsilon})$ communication complexity. Our stochastic gradient computation complexity is also the same as the ones of single-machine accelerated VR methods, such as Katyusha, and our communication complexity keeps the same as those of accelerated full batch decentralized methods, such as MSDA.


SYNTHESIS: A Semi-Asynchronous Path-Integrated Stochastic Gradient Method for Distributed Learning in Computing Clusters

arXiv.org Artificial Intelligence

To increase the training speed of distributed learning, recent years have witnessed a significant amount of interest in developing both synchronous and asynchronous distributed stochastic variance-reduced optimization methods. However, all existing synchronous and asynchronous distributed training algorithms suffer from various limitations in either convergence speed or implementation complexity. This motivates us to propose an algorithm called STNTHESIS (semi-asynchronous path-integrated stochastic gradient search), which leverages the special structure of the variance-reduction framework to overcome the limitations of both synchronous and asynchronous distributed learning algorithms while retaining their salient features. We consider two implementations of STNTHESIS under distributed and shared memory architectures. We show that our STNTHESIS algorithms have $O(\sqrt{N}\epsilon^{-2}(\Delta+1)+N)$ and $O(\sqrt{N}\epsilon^{-2}(\Delta+1) d+N)$ computational complexities for achieving an $\epsilon$-stationary point in non-convex learning under distributed and shared memory architectures, respectively, where N denotes the total number of training samples and $\Delta$ represents the maximum delay of the workers. Moreover, we investigate the generalization performance of \algname by establishing algorithmic stability bounds for quadratic strongly convex and non-convex optimization. We further conduct extensive numerical experiments to verify our theoretical findings


Node-Level Differentially Private Graph Neural Networks

arXiv.org Artificial Intelligence

Graph Neural Networks (GNNs) are a popular technique for modelling graph-structured data and computing node-level representations via aggregation of information from the neighborhood of each node. However, this aggregation implies an increased risk of revealing sensitive information, as a node can participate in the inference for multiple nodes. This implies that standard privacy-preserving machine learning techniques, such as differentially private stochastic gradient descent (DP-SGD) - which are designed for situations where each data point participates in the inference for one point only - either do not apply, or lead to inaccurate models. In this work, we formally define the problem of learning GNN parameters with node-level privacy, and provide an algorithmic solution with a strong differential privacy guarantee. We employ a careful sensitivity analysis and provide a non-trivial extension of the privacy-by-amplification technique to the GNN setting. An empirical evaluation on standard benchmark datasets demonstrates that our method is indeed able to learn accurate privacy-preserving GNNs which outperform both private and non-private methods that completely ignore graph information.


A simplified convergence theory for Byzantine resilient stochastic gradient descent

arXiv.org Artificial Intelligence

In distributed learning, a central server trains a model according to updates provided by nodes holding local data samples. In the presence of one or more malicious servers sending incorrect information (a Byzantine adversary), standard algorithms for model training such as stochastic gradient descent (SGD) fail to converge. In this paper, we present a simplified convergence theory for the generic Byzantine Resilient SGD method originally proposed by Blanchard et al. [NeurIPS 2017]. Compared to the existing analysis, we shown convergence to a stationary point in expectation under standard assumptions on the (possibly nonconvex) objective function and flexible assumptions on the stochastic gradients.


Grasp'D: Differentiable Contact-rich Grasp Synthesis for Multi-fingered Hands

arXiv.org Artificial Intelligence

The study of hand-object interaction requires generating viable grasp poses for high-dimensional multi-finger models, often relying on analytic grasp synthesis which tends to produce brittle and unnatural results. This paper presents Grasp'D, an approach for grasp synthesis with a differentiable contact simulation from both known models as well as visual inputs. We use gradient-based methods as an alternative to sampling-based grasp synthesis, which fails without simplifying assumptions, such as pre-specified contact locations and eigengrasps. Such assumptions limit grasp discovery and, in particular, exclude high-contact power grasps. In contrast, our simulation-based approach allows for stable, efficient, physically realistic, high-contact grasp synthesis, even for gripper morphologies with high-degrees of freedom. We identify and address challenges in making grasp simulation amenable to gradient-based optimization, such as non-smooth object surface geometry, contact sparsity, and a rugged optimization landscape. Grasp'D compares favorably to analytic grasp synthesis on human and robotic hand models, and resultant grasps achieve over 4x denser contact, leading to significantly higher grasp stability. Video and code available at https://graspd-eccv22.github.io/.


A Stochastic Variance Reduced Gradient using Barzilai-Borwein Techniques as Second Order Information

arXiv.org Artificial Intelligence

In this paper, we consider to improve the stochastic variance reduce gradient (SVRG) method via incorporating the curvature information of the objective function. We propose to reduce the variance of stochastic gradients using the computationally efficient Barzilai-Borwein (BB) method by incorporating it into the SVRG. We also incorporate a BB-step size as its variant. We prove its linear convergence theorem that works not only for the proposed method but also for the other existing variants of SVRG with second-order information. We conduct the numerical experiments on the benchmark datasets and show that the proposed method with constant step size performs better than the existing variance reduced methods for some test problems.


Critical Bach Size Minimizes Stochastic First-Order Oracle Complexity of Deep Learning Optimizer using Hyperparameters Close to One

arXiv.org Artificial Intelligence

Practical results have shown that deep learning optimizers using small constant learning rates, hyperparameters close to one, and large batch sizes can find the model parameters of deep neural networks that minimize the loss functions. We first show theoretical evidence that the momentum method (Momentum) and adaptive moment estimation (Adam) perform well in the sense that the upper bound of the theoretical performance measure is small with a small constant learning rate, hyperparameters close to one, and a large batch size. Next, we show that there exists a batch size called the critical batch size minimizing the stochastic first-order oracle (SFO) complexity, which is the stochastic gradient computation cost, and that SFO complexity increases once the batch size exceeds the critical batch size. Finally, we provide numerical results that support our theoretical results. That is, the numerical results indicate that Adam using a small constant learning rate, hyperparameters close to one, and the critical batch size minimizing SFO complexity has faster convergence than Momentum and stochastic gradient descent (SGD).


Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization

arXiv.org Artificial Intelligence

We propose and analyze several stochastic gradient algorithms for finding stationary points or local minimum in nonconvex, possibly with nonsmooth regularizer, finite-sum and online optimization problems. First, we propose a simple proximal stochastic gradient algorithm based on variance reduction called ProxSVRG+. We provide a clean and tight analysis of ProxSVRG+, which shows that it outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, hence solves an open problem proposed in Reddi et al. (2016b). Also, ProxSVRG+ uses much less proximal oracle calls than ProxSVRG (Reddi et al., 2016b) and extends to the online setting by avoiding full gradient computations. Then, we further propose an optimal algorithm, called SSRGD, based on SARAH (Nguyen et al., 2017) and show that SSRGD further improves the gradient complexity of ProxSVRG+ and achieves the optimal upper bound, matching the known lower bound of (Fang et al., 2018; Li et al., 2021). Moreover, we show that both ProxSVRG+ and SSRGD enjoy automatic adaptation with local structure of the objective function such as the Polyak-\L{}ojasiewicz (PL) condition for nonconvex functions in the finite-sum case, i.e., we prove that both of them can automatically switch to faster global linear convergence without any restart performed in prior work ProxSVRG (Reddi et al., 2016b). Finally, we focus on the more challenging problem of finding an $(\epsilon, \delta)$-local minimum instead of just finding an $\epsilon$-approximate (first-order) stationary point (which may be some bad unstable saddle points). We show that SSRGD can find an $(\epsilon, \delta)$-local minimum by simply adding some random perturbations. Our algorithm is almost as simple as its counterpart for finding stationary points, and achieves similar optimal rates.