Goto

Collaborating Authors

parallelism


PyTorch 1.6 Released, Microsoft To Take Care Of The Windows Version of PyTorch

#artificialintelligence

Recently, Facebook announced the availability of the latest version of PyTorch, PyTorch 1.6. The social media giant also made a massive announcement that Microsoft has expanded its participation in the PyTorch community and is taking ownership of the development and maintenance of the PyTorch to build for Windows. PyTorch is one of the most popular machine learning libraries in Python. The version 1.6 release includes several new APIs, tools for performance improvement and profiling, as well as significant updates to both distributed data-parallel (DDP) and remote procedure call (RPC) based distributed training. According to the blog post, from this release onward, features will be classified as Stable, Beta and Prototype, where Prototype features are not included as part of the binary distribution and are instead available through either building from source, using nightlies or via a compiler flag. Automatic mixed precision (AMP) training is now natively supported and is a stable feature.


A Guide to Production Level Deep Learning

#artificialintelligence

Deploying deep learning models in production can be challenging, as it is far beyond training models with good performance. This post aims to be an engineering guideline for building production-level deep learning systems which will be deployed in real world applications. The material presented here is borrowed from Full Stack Deep Learning Bootcamp (by Pieter Abbeel at UC Berkeley, Josh Tobin at OpenAI, and Sergey Karayev at Turnitin), TFX workshop by Robert Crowe, and Pipeline.ai's Fun fact: 85% of AI projects fail. In the following, we will go through each module and recommend toolsets and frameworks as well as best practices from practitioners that fit each component.


The Open Source Technologies Behind One of the Biggest Language Models in History

#artificialintelligence

Transformers and pre-trained models can be considered one of the most important developments in the recent years of deep learning. Beyond the research breakthroughts, Transformers have redefined the natural language understanding(NLU) space sparking a race between lead AI vendors to build bigger and more efficient neural networks. The Transformer architecture has been behind famous models such as Google's BERT, Facebook's RoBERTa or OpenAI's GPT-3. Is not surprising that many people believe that only big companies have the resources to tackle the implementation of Transformer models. Earlier this year, the deep learning community was astonished when Microsoft Research unveiled the Turing Natural Language Generation (T-NLG) model which, at the time, was considered the largest natural language processing(NLP) model in the history of artificial intelligence(AI) with 17 billion parameters.


Accelerating Linear Models for Machine Learning

#artificialintelligence

If you have ever used Python and scikit-learn to build machine learning (ML) models from large data sets, you may have also wished that you could make these computations go faster. What if I told you that altering a single line of code could accelerate your ML computations? What if I also told you that getting faster results doesn't require specialized hardware? In this article, I will teach you how to train ridge regression models using a version of scikit-learn that is optimized for Intel CPUs, then compare the performance and accuracy of these models trained with the vanilla scikit-learn library. This article continues our series on accelerated ML algorithms.


Domain-Specific Hardware Accelerators

Communications of the ACM

From the simple embedded processor in your washing machine to powerful processors in data center servers, most computing today takes place on general-purpose programmable processors or CPUs. CPUs are attractive because they are easy to program and because large code bases exist for them. The programmability of CPUs stems from their execution of sequences of simple instructions, such as ADD or BRANCH; however, the energy required to fetch and interpret an instruction is 10x to 4000x more than that required to perform a simple operation such as ADD. This high overhead was acceptable when processor performance and efficiency were scaling according to Moore's Law.32 One could simply wait and an existing application would run faster and more efficiently. Our economy has become dependent on these increases in computing performance and efficiency to enable new features and new applications. Today, Moore's Law has largely ended,12 and we must look to alternative architectures with lower overhead, such as domain-specific accelerators, to continue scaling of performance and efficiency. There are several ways to realize domain-specific accelerators as discussed in the sidebar on accelerator options. A domain-specific accelerator is a hardware computing engine that is specialized for a particular domain of applications. Accelerators have been designed for graphics,26 deep learning,16 simulation,2 bioinformatics,49 image processing,38 and many other tasks. Accelerators can offer orders of magnitude improvements in performance/cost and performance/W compared to general-purpose computers. For example, our bioinformatics accelerator, Darwin,49 is up to 15,000x faster than a CPU at reference-based, long-read assembly. The performance and efficiency of accelerators is due to a combination of specialized operations, parallelism, efficient memory systems, and reduction of overhead. Domain-specific accelerators7 are becoming more pervasive and more visible, because they are one of the few remaining ways to continue to improve performance and efficiency now that Moore's Law has ended.22 Most applications require modifications to achieve high speed up on domain-specific accelerators. These applications are highly tuned to balance the performance of conventional processors with their memory systems.


Cpp-Taskflow v2: A General-purpose Parallel and Heterogeneous Task Programming System at Scale

arXiv.org Artificial Intelligence

The Cpp-Taskflow project addresses the long-standing question: How can we make it easier for developers to write parallel and heterogeneous programs with high performance and simultaneous high productivity? Cpp-Taskflow develops a simple and powerful task programming model to enable efficient implementations of heterogeneous decomposition strategies. Our programming model empowers users with both static and dynamic task graph constructions to incorporate a broad range of computational patterns including hybrid CPU-GPU computing, dynamic control flow, and irregularity. We develop an efficient heterogeneous work-stealing strategy that adapts worker threads to available task parallelism at any time during the graph execution. We have demonstrated promising performance of Cpp-Taskflow on both micro-benchmark and real-world applications. As an example, we solved a large machine learning workload by up to 1.5x faster, 1.6x less memory, and 1.7x fewer lines of code than two industrial-strength systems, oneTBB and StarPU, on a machine of 40 CPUs and 4 GPUs.


TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism

arXiv.org Machine Learning

A good parallelization strategy can significantly improve the efficiency or reduce the cost for the distributed training of deep neural networks (DNNs). Recently, several methods have been proposed to find efficient parallelization strategies but they all optimize a single objective (e.g., execution time, memory consumption) and produce only one strategy. We propose FT, an efficient algorithm that searches for an optimal set of parallelization strategies to allow the trade-off among different objectives. FT can adapt to different scenarios by minimizing the memory consumption when the number of devices is limited and fully utilize additional resources to reduce the execution time. For popular DNN models (e.g., vision, language), an in-depth analysis is conducted to understand the trade-offs among different objectives and their influence on the parallelization strategies. We also develop a user-friendly system, called TensorOpt, which allows users to run their distributed DNN training jobs without caring the details of parallelization strategies. Experimental results show that FT runs efficiently and provides accurate estimation of runtime costs, and TensorOpt is more flexible in adapting to resource availability compared with existing frameworks.


Enabling Efficient and Flexible FPGA Virtualization for Deep Learning in the Cloud

arXiv.org Machine Learning

FPGAs have shown great potential in providing low-latency and energy-efficient solutions for deep neural network (DNN) inference applications. Currently, the majority of FPGA-based DNN accelerators in the cloud run in a time-division multiplexing way for multiple users sharing a single FPGA, and require re-compilation with $\sim$100 s overhead. Such designs lead to poor isolation and heavy performance loss for multiple users, which are far away from providing efficient and flexible FPGA virtualization for neither public nor private cloud scenarios. To solve these problems, we introduce a novel virtualization framework for instruction architecture set (ISA) based on DNN accelerators by sharing a single FPGA. We enable the isolation by introducing a two-level instruction dispatch module and a multi-core based hardware resources pool. Such designs provide isolated and runtime-programmable hardware resources, further leading to performance isolation for multiple users. On the other hand, to overcome the heavy re-compilation overheads, we propose a tiling-based instruction frame package design and two-stage static-dynamic compilation. Only the light-weight runtime information is re-compiled with $\sim$1 ms overhead, thus the performance is guaranteed for the private cloud. Our extensive experimental results show that the proposed virtualization design achieves 1.07-1.69x and 1.88-3.12x throughput improvement over previous static designs using the single-core and the multi-core architectures, respectively.


Pipelined Backpropagation at Scale: Training Large Models without Batches

arXiv.org Machine Learning

Parallelism is crucial for accelerating the training of deep neural networks. Pipeline parallelism can provide an efficient alternative to traditional data parallelism by allowing workers to specialize. Performing mini-batch SGD using pipeline parallelism has the overhead of filling and draining the pipeline. Pipelined Backpropagation updates the model parameters without draining the pipeline. This removes the overhead but introduces stale gradients and inconsistency between the weights used on the forward and backward passes, reducing final accuracy and the stability of training. We introduce Spike Compensation and Linear Weight Prediction to mitigate these effects. Analysis on a convex quadratic shows that both methods effectively counteract staleness. We train multiple convolutional networks at a batch size of one, completely replacing batch parallelism with fine-grained pipeline parallelism. With our methods, Pipelined Backpropagation achieves full accuracy on CIFAR-10 and ImageNet without hyperparameter tuning.


Compiler Auto-Vectorization with Imitation Learning

Neural Information Processing Systems

Modern microprocessors are equipped with single instruction multiple data (SIMD) or vector instruction sets which allow compilers to exploit fine-grained data level parallelism. To exploit this parallelism, compilers employ auto-vectorization techniques to automatically convert scalar code into vector code. Larsen & Amarasinghe (2000) first introduced superword level parallelism (SLP) based vectorization, which is one form of vectorization popularly used by compilers. Current compilers employ hand-crafted heuristics and typically only follow one SLP vectorization strategy which can be suboptimal. Recently, Mendis & Amarasinghe (2018) formulated the instruction packing problem of SLP vectorization by leveraging an integer linear programming (ILP) solver, achieving superior runtime performance.