Speeding Up A.I. - USC Viterbi School of Engineering


With a new three-year NSF grant, Ming Hsieh Department of Electrical and Computer Engineering researchers hope to solve the problem of scalable parallelism for AI. Co-PI's Professor Viktor Prasanna, Charles Lee Powell Chair in Electrical and Computer Engineering and Professor Xuehai Qian both from USC Viterbi, along with USC Viterbi alum and assistant professor at Northeastern University Yanzhi Wang, and USC Viterbi senior research associate Ajitesh Srivastava were awarded the $800,000 grant last month. Parallelism is the ability of an algorithm to perform several computations at the same time, rather than sequentially. For artificial intelligence challenges which require fast solutions, like the image processing related to autonomous vehicles, parallelism is an essential step to make these technologies practical to every-day life. Parallelism in neural networks has been explored, but the problem has been scaling it up to a point where it's applicable in time-critical/realtime tasks.

Compiler Auto-Vectorization with Imitation Learning

Neural Information Processing Systems

Modern microprocessors are equipped with single instruction multiple data (SIMD) or vector instruction sets which allow compilers to exploit fine-grained data level parallelism. To exploit this parallelism, compilers employ auto-vectorization techniques to automatically convert scalar code into vector code. Larsen & Amarasinghe (2000) first introduced superword level parallelism (SLP) based vectorization, which is one form of vectorization popularly used by compilers. Current compilers employ hand-crafted heuristics and typically only follow one SLP vectorization strategy which can be suboptimal. Recently, Mendis & Amarasinghe (2018) formulated the instruction packing problem of SLP vectorization by leveraging an integer linear programming (ILP) solver, achieving superior runtime performance.

Estimation, Optimization, and Parallelism when Data is Sparse

Neural Information Processing Systems

We study stochastic optimization problems when the \emph{data} is sparse, which is in a sense dual to the current understanding of high-dimensional statistical learning and optimization. We highlight both the difficulties---in terms of increased sample complexity that sparse data necessitates---and the potential benefits, in terms of allowing parallelism and asynchrony in the design of algorithms. Concretely, we derive matching upper and lower bounds on the minimax rate for optimization and learning with sparse data, and we exhibit algorithms achieving these rates. Our algorithms are adaptive: they achieve the best possible rate for the data observed. We also show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms, providing experimental evidence complementing our theoretical results on medium to large-scale learning tasks.

Ouroboros: On Accelerating Training of Transformer-Based Language Models

Neural Information Processing Systems

Language models are essential for natural language processing (NLP) tasks, such as machine translation and text summarization. Remarkable performance has been demonstrated recently across many NLP domains via a Transformer-based language model with over a billion parameters, verifying the benefits of model size. Model parallelism is required if a model is too large to fit in a single computing device. Current methods for model parallelism either suffer from backward locking in backpropagation or are not applicable to language models. We propose the first model-parallel algorithm that speeds the training of Transformer-based language models.

Affine Transformation- Image Processing In TensorFlow- Part 1


Affine Transformation helps to modify the geometric structure of the image, preserving parallelism of lines but not the lengths and angles. Affine Transformation helps to modify the geometric structure of the image, preserving parallelism of lines but not the lengths and angles. It preserves collinearity and ratios of distances. It is one type of method we can use in Machine Learning and Deep Learning for Image Processing and also for Image Augmentation. This technique is also used to correct Geometric Distortions and Deformations that occur with non-ideal camera angles.

Deep-Learning Framework SINGA Graduates to Top-Level Apache Project


The Apache Software Foundation (ASF) recently announced that SINGA, a framework for distributed deep-learning, has graduated to top-level project (TLP) status, signifying the project's maturity and stability. SINGA has already been adopted by companies in several sectors, including banking and healthcare. Originally developed at the National University of Singapore, SINGA joined ASF's incubator in March 2015. SINGA provides a framework for distributing the work of training deep-learning models across a cluster of machines, in order to reduce the time needed to train the model. In addition to its use as a platform for academic research, SINGA has been used in commercial applications by Citigroup and CBRE, as well as in several health-care applications, including an app to aid patients with pre-diabetes.



The Livermore Big Artificial Neural Network toolkit (LBANN) is an open-source, HPC-centric, deep learning training framework that is optimized to compose multiple levels of parallelism. LBANN provides model-parallel acceleration through domain decomposition to optimize for strong scaling of network training. It also allows for composition of model-parallelism with both data parallelism and ensemble training methods for training large neural networks with massive amounts of data. LBANN is able to advantage of tightly-coupled accelerators, low-latency high-bandwidth networking, and high-bandwidth parallel file systems. LBANN supports state-of-the-art training algorithms such as unsupervised, self-supervised, and adversarial (GAN) training methods in addition to traditional supervised learning.

The 4 Research Techniques to Train Deep Neural Network Models More Efficiently


Deep learning and unsupervised feature learning have shown great promise in many practical applications. State-of-the-art performance has been reported in several domains, ranging from speech recognition and image recognition to text processing and beyond. It's also been observed that increasing the scale of deep learning--with respect to numbers of training examples, model parameters, or both--can drastically improve accuracy. These results have led to a surge of interest in scaling up the training and inference algorithms used for these models and in improving optimization techniques for both. The use of GPUs is a significant advance in recent years that makes the training of modestly-sized deep networks practical.

Optimal Mini-Batch Size Selection for Fast Gradient Descent Machine Learning

Jerry Quinn IBM T.J. Watson Research Center Y orktown Heights, NY 10598 V alentina Salapura IBM T.J. Watson Research Center Y orktown Heights, NY 10598 Abstract This paper presents a methodology for selecting the mini-batch size that minimizes Stochastic Gradient Descent (SGD) learning time for single and multiple learner problems. By de-coupling algorithmic analysis issues from hardware and software implementation details, we reveal a robust empirical inverse law between mini-batch size and the average number of SGD updates required to converge to a specified error threshold. Combining this empirical inverse law with measured system performance, we create an accurate, closed-form model of average training time and show how this model can be used to identify quantifiable implications for both algorithmic and hardware aspects of machine learning. We demonstrate the inverse law empirically, on both image recognition (MNIST, CIFAR10 and CIFAR100) and machine translation (Europarl) tasks, and provide a theoretic justification via proving a novel bound on mini-batch SGD training. Introduction In this paper, we present an empirical law, with theoretical justification, linking the number of learning iterations to the mini-batch size. From this result, we derive a principled methodology for selecting mini-batch size w.r.t. This methodology saves training time and provides both intuition and a principled approach for optimizing machine learning algorithms and machine learning hardware system design. Further, we use our methodology to show that focusing on weak scaling can lead to suboptimal training times because, by neglecting the dependence of convergence time on the size of the mini-batch used, weak scaling does not always minimize the training time.

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow Artificial Intelligence

The enormous amount of data and computation required to train DNNs have led to the rise of various parallelization strategies. Broadly, there are two strategies: 1) Data-Parallelism -- replicating the DNN on multiple processes and training on different training samples, and 2) Model-Parallelism -- dividing elements of the DNN itself into partitions across different processes. While data-parallelism has been extensively studied and developed, model-parallelism has received less attention as it is non-trivial to split the model across processes. In this paper, we propose HyPar-Flow: a framework for scalable and user-transparent parallel training of very large DNNs (up to 5,000 layers). We exploit TensorFlow's Eager Execution features and Keras APIs for model definition and distribution. HyPar-Flow exposes a simple API to offer data, model, and hybrid (model + data) parallel training for models defined using the Keras API. Under the hood, we introduce MPI communication primitives like send and recv on layer boundaries for data exchange between model-partitions and allreduce for gradient exchange across model-replicas. Our proposed designs in HyPar-Flow offer up to 3.1x speedup over sequential training for ResNet-110 and up to 1.6x speedup over Horovod-based data-parallel training for ResNet-1001; a model that has 1,001 layers and 30 million parameters. We provide an in-depth performance characterization of the HyPar-Flow framework on multiple HPC systems with diverse CPU architectures including Intel Xeon(s) and AMD EPYC. HyPar-Flow provides 110x speed up on 128 nodes of the Stampede2 cluster at TACC for hybrid-parallel training of ResNet-1001.