Goto

Collaborating Authors

 xla




Operator Fusion in XLA: Analysis and Evaluation

Snider, Daniel, Liang, Ruofan

arXiv.org Artificial Intelligence

Machine learning (ML) compilers are an active area of research because they offer the potential to automatically speedup tensor programs. Kernel fusion is often cited as an important optimization performed by ML compilers. However, there exists a knowledge gap about how XLA, the most common ML compiler, applies this nuanced optimization, what kind of speedup it can afford, and what low-level effects it has on hardware. Our paper aims to bridge this knowledge gap by studying key compiler passes of XLA's source code. Our evaluation on a reinforcement learning environment Cartpole shows how different fusion decisions in XLA are made in practice. Furthermore, we implement several XLA kernel fusion strategies that can achieve up to 10.56x speedup compared to our baseline implementation.


Faster Text Generation with TensorFlow and XLA

#artificialintelligence

TL;DR: Text Generation on transformers using TensorFlow can now be compiled with XLA. It is up to 100x faster than before, and even faster than PyTorch -- check the colab below! As the quality of large language models increased, so did our expectations of what those models could do. Especially since the release of OpenAI's GPT-2, models with text generation capabilities have been in the spotlight. And for legitimate reasons -- these models can be used to summarize, translate, and they even have demonstrated zero-shot learning capabilities on some language tasks.


Optimizing Data Collection in Deep Reinforcement Learning

Gleeson, James, Snider, Daniel, Yang, Yvonne, Gabel, Moshe, de Lara, Eyal, Pekhimenko, Gennady

arXiv.org Artificial Intelligence

Reinforcement learning (RL) workloads take a notoriously long time to train due to the large number of samples collected at run-time from simulators. Unfortunately, cluster scale-up approaches remain expensive, and commonly used CPU implementations of simulators induce high overhead when switching back and forth between GPU computations. We explore two optimizations that increase RL data collection efficiency by increasing GPU utilization: (1) GPU vectorization: parallelizing simulation on the GPU for increased hardware parallelism, and (2) simulator kernel fusion: fusing multiple simulation steps to run in a single GPU kernel launch to reduce global memory bandwidth requirements. We find that GPU vectorization can achieve up to $1024\times$ speedup over commonly used CPU simulators. We profile the performance of different implementations and show that for a simple simulator, ML compiler implementations (XLA) of GPU vectorization outperform a DNN framework (PyTorch) by $13.4\times$ by reducing CPU overhead from repeated Python to DL backend API calls. We show that simulator kernel fusion speedups with a simple simulator are $11.3\times$ and increase by up to $1024\times$ as simulator complexity increases in terms of memory bandwidth requirements. We show that the speedups from simulator kernel fusion are orthogonal and combinable with GPU vectorization, leading to a multiplicative speedup.


Why You Should (or Shouldn't) be Using Google's JAX in 2022

#artificialintelligence

Since Google's JAX hit the scene in late 2018, it has been steadily growing in popularity, and for good reason. DeepMind announced in 2020 that it is using JAX to accelerate its research, and a growing number of publications and projects from Google Brain and others are using JAX. With all of this buzz, it seems like JAX is the next big Deep Learning framework, right? In this article we'll clarify what JAX is (and isn't), why you should care (or shouldn't, but you probably should), and whether you should (or shouldn't) use it. If you're already familiar with JAX and want to skip the benchmarks, you can jump ahead to our recommendations on when to use it here It may be best to start off with what JAX is not. JAX is not a Deep Learning framework or library, and it is not designed to ever be a Deep Learning framework or library in and of itself. In a sentence, JAX is a high performance, numerical computing library which incorporates composable function transformations[1]. This is the universal aspect of JAX that is relevant for any use case.


tensorflow/mlir-hlo

#artificialintelligence

This implements a self-contained compiler for a linear algebra set of operations inspired by XLA HLO IR using MLIR components. It is designed to provide an end-to-end flow independent of TensorFlow and XLA, but usable inside of these projects. Coding practice and conventions in this repository follow the MLIR Developer Guide in this repo as part of the intent to act as an incubator for technology to upstream. These instructions work on Linux, you may have to adjust for your plaform. Again this is something to do every time you pull from this repository and the LLVM revision changes.


Julia at NIPS and the Future of Machine Learning Tools – Julia Computing

#artificialintelligence

We are excited to share several research papers on the Julia and Flux machine learning ecosystem, to be presented at the NIPS Systems for ML Workshop. Since initially proposing the need for a first-class language and ecosystem for machine learning (ML), we have made considerable progress, including the ability to take gradients of arbitrary computations by leveraging Julia's compiler, and compiling the resulting programs to specialized hardware such as Google's Tensor Processing Units. Here we talk about these papers and the projects that have brought these to life, namely: Flux.jl [paper], Zygote.jl Flux.jl is a library that gives a fresh take on machine learning as it exposes powerful tools to the user in a non-intrusive manner while remaining completely hackable, right to its core. "Careful design of the underlying automatic differentiation allows freely mixing mathematical expressions, built-in and custom layers and algorithms with control flow in one model. This makes Flux unusually easy to extend to new problems."


Pushing the limits of GPU performance with XLA – TensorFlow – Medium

#artificialintelligence

XLA is a compiler for TensorFlow graphs that you can use to accelerate your TensorFlow ML models today with minimal source code changes. This post describes what XLA is and shows how you can try it out on your own code. TensorFlow 1.12 (with XLA) achieves significant performance gains over TF 1.11 (without XLA) on ResNet50 v1.0 training on NVIDIA Tesla V100 GPUs: 10,526 images/sec with synthetic data and 10,267 images/sec with real data (see appendix for reproduction instructions). We have observed speedups ranging from 1.13x to 3.04x on a variety of internal models. Normally when you run a TensorFlow graph, all of the operations are executed individually by the TensorFlow graph executor.


Automatic Full Compilation of Julia Programs and ML Models to Cloud TPUs

Fischer, Keno, Saba, Elliot

arXiv.org Machine Learning

Google's Cloud TPUs are a promising new hardware architecture for machine learning workloads. They have powered many of Google's milestone machine learning achievements in recent years. Google has now made TPUs available for general use on their cloud platform and as of very recently has opened them up further to allow use by non-TensorFlow frontends. We describe a method and implementation for offloading suitable sections of Julia programs to TPUs via this new API and the Google XLA compiler. Our method is able to completely fuse the forward pass of a VGG19 model expressed as a Julia program into a single TPU executable to be offloaded to the device. Our method composes well with existing compiler-based automatic differentiation techniques on Julia code, and we are thus able to also automatically obtain the VGG19 backwards pass and similarly offload it to the TPU. Targeting TPUs using our compiler, we are able to evaluate the VGG19 forward pass on a batch of 100 images in 0.23s which compares favorably to the 52.4s required for the original model on the CPU. Our implementation is less than 1000 lines of Julia, with no TPU specific changes made to the core Julia compiler or any other Julia packages.