AITopics

2009.10619

Country:

Asia > China > Jiangsu Province > Nanjing (0.04)
Asia > Singapore > Central Region > Singapore (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (1.00)

Industry: Retail (1.00)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

arXiv.org Machine LearningSep-22-2020

FedCluster: Boosting the Convergence of Federated Learning via Cluster-Cycling

Chen, Cheng, Chen, Ziyi, Zhou, Yi, Kailkhura, Bhavya

We develop FedCluster--a novel federated learning framework with improved optimization efficiency, and investigate its theoretical convergence properties. The FedCluster groups the devices into multiple clusters that perform federated learning cyclically in each learning round. Therefore, each learning round of FedCluster consists of multiple cycles of meta-update that boost the overall convergence. In nonconvex optimization, we show that FedCluster with the devices implementing the local {stochastic gradient descent (SGD)} algorithm achieves a faster convergence rate than the conventional {federated averaging (FedAvg)} algorithm in the presence of device-level data heterogeneity. We conduct experiments on deep learning applications and demonstrate that FedCluster converges significantly faster than the conventional federated learning under diverse levels of device-level data heterogeneity for a variety of local optimizers.

artificial intelligence, fedcluster, machine learning, (18 more...)

2009.10748

Country:

North America > United States > Utah (0.05)
North America > United States > California > San Francisco County > San Francisco (0.04)

Genre: Research Report (0.40)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

arXiv.org Artificial IntelligenceSep-21-2020

Scalable Adversarial Attack on Graph Neural Networks with Alternating Direction Method of Multipliers

Feng, Boyuan, Wang, Yuke, Li, Xu, Ding, Yufei

Graph neural networks (GNNs) have achieved high performance in analyzing graph-structured data and have been widely deployed in safety-critical areas, such as finance and autonomous driving. However, only a few works have explored GNNs' robustness to adversarial attacks, and their designs are usually limited by the scale of input datasets (i.e., focusing on small graphs with only thousands of nodes). In this work, we propose, SAG, the first scalable adversarial attack method with Alternating Direction Method of Multipliers (ADMM). We first decouple the large-scale graph into several smaller graph partitions and cast the original problem into several subproblems. Then, we propose to solve these subproblems using projected gradient descent on both the graph topology and the node features that lead to considerably lower memory consumption compared to the conventional attack methods. Rigorous experiments further demonstrate that SAG can significantly reduce the computation and memory overhead compared with the state-of-the-art approach, making SAG applicable towards graphs with large size of nodes and edges.

artificial intelligence, machine learning, sag, (18 more...)

arXiv.org Artificial Intelligence

2009.10233

Country:

North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(3 more...)

Genre:

Research Report (1.00)
Overview > Innovation (0.34)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Military (0.92)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

arXiv.org Machine LearningSep-20-2020

Stochastic Gradient Langevin Dynamics Algorithms with Adaptive Drifts

Kim, Sehwan, Song, Qifan, Liang, Faming

Bayesian deep learning offers a principled way to address many issues concerning safety of artificial intelligence (AI), such as model uncertainty,model interpretability, and prediction bias. However, due to the lack of efficient Monte Carlo algorithms for sampling from the posterior of deep neural networks (DNNs), Bayesian deep learning has not yet powered our AI system. We propose a class of adaptive stochastic gradient Markov chain Monte Carlo (SGMCMC) algorithms, where the drift function is biased to enhance escape from saddle points and the bias is adaptively adjusted according to the gradient of past samples. We establish the convergence of the proposed algorithms under mild conditions, and demonstrate via numerical examples that the proposed algorithms can significantly outperform the existing SGMCMC algorithms, such as stochastic gradient Langevin dynamics (SGLD), stochastic gradient Hamiltonian Monte Carlo (SGHMC) and preconditioned SGLD, in both simulation and optimization tasks.

algorithm, artificial intelligence, machine learning, (18 more...)

2009.09535

Country:

North America > United States > South Carolina (0.04)
North America > United States > Indiana > Tippecanoe County > West Lafayette (0.04)
North America > United States > Indiana > Tippecanoe County > Lafayette (0.04)
(2 more...)

Genre: Research Report (0.40)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

arXiv.org Machine LearningSep-18-2020

Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes

Zheng, Shuai, Lin, Haibin, Zha, Sheng, Li, Mu

BERT has recently attracted a lot of attention in natural language understanding (NLU) and achieved state-of-the-art results in various NLU tasks. However, its success requires large deep neural networks and huge amount of data, which result in long training time and impede development progress. Using stochastic gradient methods with large mini-batch has been advocated as an efficient tool to reduce the training time. Along this line of research, LAMB is a prominent example that reduces the training time of BERT from 3 days to 76 minutes on a TPUv3 Pod. In this paper, we propose an accelerated gradient method called LANS to improve the efficiency of using large mini-batches for training. As the learning rate is theoretically upper bounded by the inverse of the Lipschitz constant of the function, one cannot always reduce the number of optimization iterations by selecting a larger learning rate. In order to use larger mini-batch size without accuracy loss, we develop a new learning rate scheduler that overcomes the difficulty of using large learning rate. Using the proposed LANS method and the learning rate scheme, we scaled up the mini-batch sizes to 96K and 33K in phases 1 and 2 of BERT pretraining, respectively. It takes 54 minutes on 192 AWS EC2 P3dn.24xlarge instances to achieve a target F1 score of 90.5 or higher on SQuAD v1.1, achieving the fastest BERT training time in the cloud.

artificial intelligence, machine learning, optimization problem, (16 more...)

2006.13484

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report (0.50)

Industry: Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)

Zhou, Pan, Yuan, Xiaotong

Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization

Stochastic variance-reduced gradient (SVRG) algorithms have been shown to work favorably in solving large-scale learning problems. Despite the remarkable success, the stochastic gradient complexity of SVRG-type algorithms usually scales linearly with data size and thus could still be expensive for huge data. To address this deficiency, we propose a hybrid stochastic-deterministic minibatch proximal gradient (HSDMPG) algorithm for strongly-convex problems that enjoys provably improved data-size-independent complexity guarantees. More precisely, for quadratic loss $F(\theta)$ of $n$ components, we prove that HSDMPG can attain an $\epsilon$-optimization-error $\mathbb{E}[F(\theta)-F(\theta^*)]\leq\epsilon$ within $\mathcal{O}\Big(\frac{\kappa^{1.5}\epsilon^{0.75}\log^{1.5}(\frac{1}{\epsilon})+1}{\epsilon}\wedge\Big(\kappa \sqrt{n}\log^{1.5}\big(\frac{1}{\epsilon}\big)+n\log\big(\frac{1}{\epsilon}\big)\Big)\Big)$ stochastic gradient evaluations, where $\kappa$ is condition number. For generic strongly convex loss functions, we prove a nearly identical complexity bound though at the cost of slightly increased logarithmic factors. For large-scale learning problems, our complexity bounds are superior to those of the prior state-of-the-art SVRG algorithms with or without dependence on data size. Particularly, in the case of $\epsilon=\mathcal{O}\big(1/\sqrt{n}\big)$ which is at the order of intrinsic excess error bound of a learning model and thus sufficient for generalization, the stochastic gradient complexity bounds of HSDMPG for quadratic and generic loss functions are respectively $\mathcal{O} (n^{0.875}\log^{1.5}(n))$ and $\mathcal{O} (n^{0.875}\log^{2.25}(n))$, which to our best knowledge, for the first time achieve optimal generalization in less than a single pass over data. Extensive numerical results demonstrate the computational advantages of our algorithm over the prior ones.

artificial intelligence, complexity, machine learning, (13 more...)

2009.09835

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.48)

Industry: Education > Focused Education > Special Education (0.44)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.76)

Radhakrishnan, Adityanarayanan, Belkin, Mikhail, Uhler, Caroline

Linear Convergence and Implicit Regularization of Generalized Mirror Descent with Time-Dependent Mirrors

The following questions are fundamental to understanding the properties of over-parameterization in modern machine learning: (1) Under what conditions and at what rate does training converge to a global minimum? (2) What form of implicit regularization occurs through training? While significant progress has been made in answering both of these questions for gradient descent, they have yet to be answered more completely for general optimization methods. In this work, we establish sufficient conditions for linear convergence and obtain approximate implicit regularization results for generalized mirror descent (GMD), a generalization of mirror descent with a possibly time-dependent mirror. GMD subsumes popular first order optimization methods including gradient descent, mirror descent, and preconditioned gradient descent methods such as Adagrad. By using the Polyak-Lojasiewicz inequality, we first present a simple analysis under which non-stochastic GMD converges linearly to a global minimum. We then present a novel, Taylor-series based analysis to establish sufficient conditions for linear convergence of stochastic GMD. As a corollary, our result establishes sufficient conditions and provides learning rates for linear convergence of stochastic mirror descent and Adagrad. Lastly, we obtain approximate implicit regularization results for GMD by proving that GMD converges to an interpolating solution that is approximately the closest interpolating solution to the initialization in l2-norm in the dual space, thereby generalizing the result of Azizan, Lale, and Hassibi (2019) in the full batch setting.

artificial intelligence, convergence, machine learning, (15 more...)

2009.08574

Country:

North America > United States > Massachusetts (0.04)
North America > United States > California > San Diego County > San Diego (0.04)

Genre: Research Report (1.00)

Industry: Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.90)

Byzantine-Robust Variance-Reduced Federated Learning over Distributed Non-i.i.d. Data

Peng, Jie, Wu, Zhaoxian, Ling, Qing

We propose a Byzantine-robust variance-reduced stochastic gradient descent (SGD) method to solve the distributed finite-sum minimization problem when the data on the workers are not independent and identically distributed (i.i.d.). During the learning process, an unknown number of Byzantine workers may send malicious messages to the master node, leading to remarkable learning error. Most of the Byzantine-robust methods address this issue by using robust aggregation rules to aggregate the received messages, but rely on the assumption that all the regular workers have i.i.d. data, which is not the case in many federated learning applications. In light of the significance of reducing stochastic gradient noise for mitigating the effect of Byzantine attacks, we use a resampling strategy to reduce the impact of both inner variation (that describes the sample heterogeneity on every regular worker) and outer variation (that describes the sample heterogeneity among the regular workers), along with a stochastic average gradient algorithm (SAGA) to fully eliminate the inner variation. The variance-reduced messages are then aggregated with a robust geometric median operator. Under certain conditions, we prove that the proposed method reaches a neighborhood of the optimal solution with linear convergence rate, and the learning error is much smaller than those given by the state-of-the-art methods in the non-i.i.d. setting. Numerical experiments corroborate the theoretical results and show satisfactory performance of the proposed method.

artificial intelligence, machine learning, variation, (15 more...)

2009.08161

Country: Asia > China > Guangdong Province > Guangzhou (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.77)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.57)

Thorbjarnarson, Tómas, Yorke-Smith, Neil

On Training Neural Networks with Mixed Integer Programming

Recent work has shown potential in using Mixed Integer Programming (MIP) solvers to optimize certain aspects of neural networks (NN). However little research has gone into training NNs with solvers. State of the art methods to train NNs are typically gradient-based and require significant data, computation on GPUs and extensive hyper-parameter tuning. In contrast, training with MIP solvers should not require GPUs or hyper-parameter tuning but can likely not handle large amounts of data. This work builds on recent advances that train binarized NNs using MIP solvers. We go beyond current work by formulating new MIP models to increase the amount of data that can be used and to train non-binary integer-valued networks. Our results show that comparable results to using gradient descent can be achieved when minimal data is available.

artificial intelligence, machine learning, solver, (18 more...)

2009.03825

Country: Europe > Netherlands > South Holland > Delft (0.05)

Genre:

Research Report > New Finding (0.54)
Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Arachie, Chidubem, Huang, Bert

Constrained Labeling for Weakly Supervised Learning

arXiv.org Artificial IntelligenceSep-15-2020

Curation of large fully supervised datasets has become one of the major roadblocks for machine learning. Weak supervision provides an alternative to supervised learning by training with cheap, noisy, and possibly correlated labeling functions from varying sources. The key challenge in weakly supervised learning is combining the different weak supervision signals while navigating misleading correlations in their errors. In this paper, we propose a simple data-free approach for combining weak supervision signals by defining a constrained space for the possible labels of the weak signals and training with a random labeling within this constrained space. Our method is efficient and stable, converging after a few iterations of gradient descent. We prove theoretical conditions under which the worst-case error of the randomized label decreases with the rank of the linear constraints. We show experimentally that our method outperforms other weak supervision methods on various text- and image-classification tasks.

artificial intelligence, machine learning, weak signal, (15 more...)

arXiv.org Artificial Intelligence

2009.0736

Country:

North America > United States > Virginia (0.04)
Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)

Genre: Research Report (0.82)

Industry:

Media (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)