NVIDIA TensorRT is a high-performance deep learning inference library for production environments. Power efficiency and speed of response are two key metrics for deployed deep learning applications, because they directly affect the user experience and the cost of the service provided. Tensor RT automatically optimizes trained neural networks for run-time performance, delivering up to 16x higher energy efficiency (performance per watt) on a Tesla P100 GPU compared to common CPU-only deep learning inference systems (see Figure 1). Figure 2 shows the performance of NVIDIA Tesla P100 and K80 running inference using TensorRT with the relatively complex GoogLenet neural network architecture. In this post we will show you how you can use Tensor RT to get the best efficiency and performance out of your trained deep neural network on a GPU-based deployment platform.
In GTC China yesterday, NVIDIA made a series of announcements. Some had to do with local partners and related achievements, such as powering the likes of Alibaba and Baidu. Partners of this magnitude are bound to generate impressive numbers and turn some heads. Another part of the announcements had to do with new hardware. NVIDIA unveiled Orin, a new system-on-a-chip (SoC) designed for autonomous vehicles and robots, as well as a new software-defined platform powered by the SoC, called Nvidia Drive AGX Orin.
NVIDIA DALI (Data Loading LIbrary) is an open source library researchers can use to accelerate data pipelines by 15% or more. By accelerating data augmentations using GPUs, NVIDIA DALI addresses performance bottlenecks in today's computer vision deep learning applications that include complex, multi-stage data augmentation steps. With DALI, deep learning researchers can scale training performance on image classification models such as ResNet-50 with MXNet, TensorFlow, and PyTorch across Amazon Web Services P3 8 GPU instances or DGX-1 systems with Volta GPUs. Framework users will have lesser code duplication due to consistent high-performance data loading and augmentation across frameworks. To demonstrate its power, NVIDIA data scientists used it to fine-tune DGX-2 to achieve a record-breaking 15,000 images per second in training.
NVIDIA today introduced groundbreaking inference software that developers everywhere can use to deliver conversational AI applications, slashing inference latency that until now has impeded true, interactive engagement. NVIDIA TensorRT 7 -- the seventh generation of the company's inference software development kit -- opens the door to smarter human-to-AI interactions, enabling real-time engagement with applications such as voice agents, chatbots and recommendation engines. It is estimated that there are 3.25 billion digital voice assistants being used in devices around the world, according to Juniper Research. By 2023, that number is expected to reach 8 billion, more than the world's total population. TensorRT 7 features a new deep learning compiler designed to automatically optimize and accelerate the increasingly complex recurrent and transformer-based neural networks needed for AI speech applications.
Welcome to this introduction to TensorRT, our platform for deep learning inference. You will learn how to deploy a deep learning application onto a GPU, increasing throughput and reducing latency during inference. TensorRT provides APIs and parsers to import trained models from all major deep learning frameworks. It then generates optimized runtime engines deployable in the datacenter as well as in automotive and embedded environments. Applications deployed on GPUs with TensorRT perform up to 40x faster than CPU-only platforms. This tutorial uses a C example to walk you through importing an ONNX model into TensorRT, applying optimizations, and generating a high-performance runtime engine for the datacenter environment.