AITopics | adpsgd

Collaborating Authors

adpsgd

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A(DP)$^2$SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent with Differential Privacy

Xu, Jie, Zhang, Wei, Wang, Fei

arXiv.org Machine LearningAug-20-2020

As deep learning models are usually massive and complex, distributed learning is essential for increasing training efficiency. Moreover, in many real-world application scenarios like healthcare, distributed learning can also keep the data local and protect privacy. A popular distributed learning strategy is federated learning, where there is a central server storing the global model and a set of local computing nodes updating the model parameters with their corresponding data. The updated model parameters will be processed and transmitted to the central server, which leads to heavy communication costs. Recently, asynchronous decentralized distributed learning has been proposed and demonstrated to be a more efficient and practical strategy where there is no central server, so that each computing node only communicates with its neighbors. Although no raw data will be transmitted across different local nodes, there is still a risk of information leak during the communication process for malicious participants to make attacks. In this paper, we present a differentially private version of asynchronous decentralized parallel SGD (ADPSGD) framework, or A(DP)$^2$SGD for short, which maintains communication efficiency of ADPSGD and prevents the inference from malicious participants. Specifically, R{\'e}nyi differential privacy is used to provide tighter privacy analysis for our composite Gaussian mechanisms while the convergence rate is consistent with the non-private version. Theoretical analysis shows A(DP)$^2$SGD also converges at the optimal $\mathcal{O}(1/\sqrt{T})$ rate as SGD. Empirically, A(DP)$^2$SGD achieves comparable model accuracy as the differentially private version of Synchronous SGD (SSGD) but runs much faster than SSGD in heterogeneous computing environments.

artificial intelligence, deep learning, machine learning, (14 more...)

arXiv.org Machine Learning

2008.09246

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Adaptive Periodic Averaging: A Practical Approach to Reducing Communication in Distributed Learning

Jiang, Peng, Agrawal, Gagan

arXiv.org Machine LearningJul-12-2020

Stochastic Gradient Descent (SGD) is the key learning algorithm for many machine learning tasks. Because of its computational costs, there is a growing interest in accelerating SGD on HPC resources like GPU clusters. However, the performance of parallel SGD is still bottlenecked by the high communication costs even with a fast connection among the machines. A simple approach to alleviating this problem, used in many existing efforts, is to perform communication every few iterations, using a constant averaging period. In this paper, we show that the optimal averaging period in terms of convergence and communication cost is not a constant, but instead varies over the course of the execution. Specifically, we observe that reducing the variance of model parameters among the computing nodes is critical to the convergence of periodic parameter averaging SGD. Given a fixed communication budget, we show that it is more beneficial to synchronize more frequently in early iterations to reduce the initial large variance and synchronize less frequently in the later phase of the training process. We propose a practical algorithm, named ADaptive Periodic parameter averaging SGD (ADPSGD), to achieve a smaller overall variance of model parameters, and thus better convergence compared with the Constant Periodic parameter averaging SGD (CPSGD). We evaluate our method with several image classification benchmarks and show that our ADPSGD indeed achieves smaller training losses and higher test accuracies with smaller communication compared with CPSGD. Compared with gradient-quantization SGD, we show that our algorithm achieves faster convergence with only half of the communication. Compared with full-communication SGD, our ADPSGD achieves 1:14x to 1:27x speedups with a 100Gbps connection among computing nodes, and the speedups increase to 1:46x ~ 1:95x with a 10Gbps connection.

adpsgd, artificial intelligence, machine learning, (19 more...)

arXiv.org Machine Learning

2007.06134

Country:

Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Nevada > Clark County > Las Vegas (0.04)
(5 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

A Highly Efficient Distributed Deep Learning System For Automatic Speech Recognition

Zhang, Wei, Cui, Xiaodong, Finkler, Ulrich, Saon, George, Kayi, Abdullah, Buyuktosunoglu, Alper, Kingsbury, Brian, Kung, David, Picheny, Michael

arXiv.org Machine LearningJul-10-2019

Modern Automatic Speech Recognition (ASR) systems rely on distributed deep learning to for quick training completion. To enable efficient distributed training, it is imperative that the training algorithms can converge with a large mini-batch size. In this work, we discovered that Asynchronous Decentralized Parallel Stochastic Gradient Descent (ADPSGD) can work with much larger batch size than commonly used Synchronous SGD (SSGD) algorithm. On commonly used public SWB-300 and SWB-2000 ASR datasets, ADPSGD can converge with a batch size 3X as large as the one used in SSGD, thus enable training at a much larger scale. Further, we proposed a Hierarchical-ADPSGD (H-ADPSGD) system in which learners on the same computing node construct a super learner via a fast allreduce implementation, and super learners deploy ADPSGD algorithm among themselves. On a 64 Nvidia V100 GPU cluster connected via a 100Gb/s Ethernet network, our system is able to train SWB-2000 to reach a 7.6% WER on the Hub5-2000 Switchboard (SWB) test-set and a 13.2% WER on the Call-home (CH) test-set in 5.2 hours. To the best of our knowledge, this is the fastest ASR training system that attains this level of model accuracy for SWB-2000 task to be ever reported in the literature.

adpsgd, artificial intelligence, machine learning, (16 more...)

arXiv.org Machine Learning

1907.05701

Genre: Research Report (0.50)

Industry: Information Technology (0.35)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Distributed Deep Learning Strategies For Automatic Speech Recognition

Zhang, Wei, Cui, Xiaodong, Finkler, Ulrich, Kingsbury, Brian, Saon, George, Kung, David, Picheny, Michael

arXiv.org Machine LearningApr-9-2019

In this paper, we propose and investigate a variety of distributed deep learning strategies for automatic speech recognition (ASR) and evaluate them with a state-of-the-art Long short-term memory (LSTM) acoustic model on the 2000-hour Switchboard (SWB2000), which is one of the most widely used datasets for ASR performance benchmark. We first investigate what are the proper hyper-parameters (e.g., learning rate) to enable the training with sufficiently large batch size without impairing the model accuracy. We then implement various distributed strategies, including Synchronous (SYNC), Asynchronous Decentralized Parallel SGD (ADPSGD) and the hybrid of the two HYBRID, to study their runtime/accuracy trade-off. We show that we can train the LSTM model using ADPSGD in 14 hours with 16 NVIDIA P100 GPUs to reach a 7.6% WER on the Hub5- 2000 Switchboard (SWB) test set and a 13.1% WER on the CallHome (CH) test set. Furthermore, we can train the model using HYBRID in 11.5 hours with 32 NVIDIA V100 GPUs without loss in accuracy.

artificial intelligence, batch size, machine learning, (17 more...)

arXiv.org Machine Learning

1904.04956

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report (0.50)

Industry: Information Technology (0.88)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback