Collaborating Authors

The 2 Types of Hardware Architectures for Efficient Training and Inference of Deep Neural Networks


Due to the popularity of deep neural networks, many recent hardware platforms have special features that target deep neural network processing. The Intel Knights Mill CPU will feature special vector instructions for deep learning. The Nvidia PASCAL GP100 GPU features 16-b floating-point (FP16) arithmetic support to perform two FP16 operations on a single-precision core for faster deep learning computation. Systems have also been built specifically for DNN processing, such as the Nvidia DGX-1 and Facebook's Big Basin custom DNN server. DNN inference has also been demonstrated on various embedded System-on-Chips (SoCs) such as Nvidia Tegra and Samsung Exynos, as well as on field-programmable gate arrays (FPGAs).

Introducing NVIDIA TITAN V: The World's Most Powerful PC Graphics Card


NVIDIA TITAN V is the most powerful graphics card ever created for the PC, driven by the world's most advanced architecture--NVIDIA Volta. NVIDIA's supercomputing GPU architecture is now here for your PC, and fueling breakthroughs in every industry.

Chips are down: Apple to stop using Intel processors in Macs, reports say

The Guardian

Apple is reportedly planning to drop Intel chips from its Mac computers as early as 2020, replacing them with processors designed in-house in the same way the company manufactures iPhones and iPads. The plan, reported by Bloomberg, has been rumoured for several years, as Apple has taken on more chip design for devices. The company's A-series of processors, currently capped by the A11 Bionic chips used in the iPhones 8, 8 Plus and X, are all designed by the company for specific purposes, and based on an architecture licensed from British firm ARM. Rumours of the switch contributed to a 6% drop in Intel's share price over the course of Monday, adding specific pain to a general collapse in tech stocks caused by fears of oncoming regulation in the wake of the Cambridge Analytica scandal at Facebook. Bloomberg reports that the initiative "is still in early developmental stages", but is intended to bring Macs into the same unified architecture that allows all of Apple's other devices to work together "seamlessly".

Nvidia unveils Turing architecture and GPUs with dedicated ray-tracing hardware


Nvidia has unveiled its new Turing architecture along with details of the first GPUs to use it. Turing includes dedicated "RT Core" hardware designed to drive ray tracing, a complex technique that can deliver extremely realistic lighting effects but has been prohibitively resource-intensive to render in real time. Nvidia calls the new Turing-based Quadro RTX the "world's first ray-tracing GPU" and claims it's the biggest leap since the company introduced CUDA in 2006. The Quadro RTX products are intended for high-end professional use, not gaming -- the flagship Quadro RTX 8000 will cost $10,000 when it ships toward the end of the year. For that, you get a GPU with 48GB of new GDDR6 memory, 4,608 CUDA cores, and 576 Tensor cores.

MAPLE: Microprocessor A Priori for Latency Estimation Artificial Intelligence

Modern deep neural networks must demonstrate state-of-the-art accuracy while exhibiting low latency and energy consumption. As such, neural architecture search (NAS) algorithms take these two constraints into account when generating a new architecture. However, efficiency metrics such as latency are typically hardware dependent requiring the NAS algorithm to either measure or predict the architecture latency. Measuring the latency of every evaluated architecture adds a significant amount of time to the NAS process. Here we propose Microprocessor A Priori for Latency Estimation MAPLE that does not rely on transfer learning or domain adaptation but instead generalizes to new hardware by incorporating a prior hardware characteristics during training. MAPLE takes advantage of a novel quantitative strategy to characterize the underlying microprocessor by measuring relevant hardware performance metrics, yielding a fine-grained and expressive hardware descriptor. Moreover, the proposed MAPLE benefits from the tightly coupled I/O between the CPU and GPU and their dependency to predict DNN latency on GPUs while measuring microprocessor performance hardware counters from the CPU feeding the GPU hardware. Through this quantitative strategy as the hardware descriptor, MAPLE can generalize to new hardware via a few shot adaptation strategy where with as few as 3 samples it exhibits a 3% improvement over state-of-the-art methods requiring as much as 10 samples. Experimental results showed that, increasing the few shot adaptation samples to 10 improves the accuracy significantly over the state-of-the-art methods by 12%. Furthermore, it was demonstrated that MAPLE exhibiting 8-10% better accuracy, on average, compared to relevant baselines at any number of adaptation samples.