Intel enlisted one of the most enthusiastic users of deep learning and artificial intelligence to help out with the chip design. "We are thrilled to have Facebook in close collaboration sharing their technical insights as we bring this new generation of AI hardware to market," said Intel CEO Brian Krzanich in a statement. On top of social media, Intel is targeting healthcare, automotive and weather, among other applications. Unlike its PC chips, the Nervana NNP is an application-specific integrated circuit (ASIC) that's specially made for both training and executing deep learning algorithms. "The speed and computational efficiency of deep learning can be greatly advanced by ASICs that are customized for ... this workload," writes Intel's VP of AI, Naveen Rao.
Neural networks apply computational resources to solve machine learning linear algebra problems with very large matrices, iterating to make statistically accurate decisions. Most of the machine learning models in operation today started in academia, such as natural language or image recognition, and were further researched by large well-staffed research and engineering teams at Google, Facebook, IBM and Microsoft. Enterprise machine learning experts and data scientists will have to start from scratch with research and iterate to build new high-accuracy models. It is a specialty business because the enterprises need four characteristics not necessarily found together: a large corpus of data for training, highly skilled data scientists and machine learning experts, a strategic problem that machine learning can solve, and a reason not to use Google's or Amazon's pay-as-you-go offerings.
The major server vendors are lining up behind Nvidia's Tesla V100 GPU accelerator in a move that is expected to make artificial intelligence and machine learning workloads more mainstream. Dell EMC, HPE, IBM and Supermicro outlined servers on Nvidia's latest GPU accelerators, which are based on the Volta architecture from the graphics chip maker. That throughput effectively takes the speed limit off AI workloads. In a blog post, IBM's Brad McCredie, vice president of the Big Blue's cognitive system development, noted that Nvidia with the V100 as well as its NVLINK PCI-Express 4 and Memory Coherence technology brings "unprecedented internal bandwidth" to AI-optimized systems.
System makers Fujitsu and Huawei Technologies reportedly are both planning to develop processors optimized for artificial intelligence workloads, moves that will put them into competition with the likes of Intel, Google, Nvidia and Advanced Micro Devices. Tech vendors are pushing hard to bring artificial intelligence (AI) and deep learning capabilities into their portfolios to meet the growing demand generated by a broad range of workloads, from data analytics to self-driving vehicles. Fujitsu engineers for the past couple of years have been working on what the company is calling a deep learning unit (DLU), but last month gave more details on the component during the International Supercomputing show. The chip reportedly will include 16 deep learning processing elements, with each of them housing eight single-instruction, multiple data execution units.
One data center provider that specializes in hosting infrastructure for Deep Learning told us most of their customers hadn't yet deployed their AI applications in production. If your on-premises Deep Learning infrastructure will do a lot of training – the computationally intensive applications used to teach neural networks things like speech and image recognition – prepare for power-hungry servers with lots of GPUs on every motherboard. While not particularly difficult to handle on-premises, one big question to answer about inferencing servers for the data center manager is how close they have to be to where input data originates. If your corporate data centers are in Ashburn, Virginia, but your Machine Learning application has to provide real-time suggestions to users in Dallas or Portland, chances are you'll need some inferencing servers in or near Dallas and Portland to make it actually feel close to real-time.
Google has developed its second-generation tensor processor--four 45-teraflops chips packed onto a 180 TFLOPS tensor processor unit (TPU) module, to be used for machine learning and artificial intelligence--and the company is bringing it to the cloud. Each card has its own high-speed interconnects, and 64 of the cards can be linked into what Google calls a pod, with 11.5 petaflops total; one petaflops is 1015 floating point operations per second. The GPUs can typically also operate in double-precision mode (64-bit numbers) and half-precision mode (16-bit numbers). But as a couple of points of comparison: AMD's forthcoming Vega GPU should offer 13 TFLOPS of single precision, 25 TFLOPS of half-precision performance, and the machine-learning accelerators that Nvidia announced recently--the Volta GPU-based Tesla V100--can offer 15 TFLOPS single precision and 120 TFLOPS for "deep learning" workloads.
Powered by NVIDIA Tesla P100 GPUs and NVIDIA's NVLink high speed multi-GPU interconnect technology, the HGX-1 comes as AI workloads – from autonomous driving and personalized healthcare to superhuman voice recognition -- are taking off in the cloud. Powered by eight NVIDIA Tesla P100 GPUs in each chassis, it features an innovative switching design – based on NVIDIA NVLink interconnect technology and the PCIe standard – enabling a CPU to dynamically connect to any number of GPUs. This allows cloud service providers that standardize on a single HGX-1 infrastructure to offer customers a range of CPU and GPU machine instance configurations to meet virtually any workload. The HGX-1 Hyperscale GPU Accelerator reference design is highly modular, allowing it to be configured in a variety of ways to optimize performance for different workloads.
Providing hyperscale data centers with a fast, flexible path for AI, the new HGX-1 hyperscale GPU accelerator is an open-source design released in conjunction with Microsoft's Project Olympus. It will enable cloud-service providers to easily adopt NVIDIA GPUs to meet surging demand for AI computing." NVIDIA Joins Open Compute Project NVIDIA is joining the Open Compute Project to help drive AI and innovation in the data center. Certain statements in this press release including, but not limited to, statements as to: the performance, impact and benefits of the HGX-1 hyperscale GPU accelerator; and NVIDIA joining the Open Compute Project are forward-looking statements that are subject to risks and uncertainties that could cause results to be materially different than expectations.
The Knights Landing Xeon Phi chips, which have been shipping in volume since June, deliver a peak performance of 3.46 teraflops at double precision and 6.92 teraflops at single precision, but do not support half precision math like the Pascal GPUs do. The Pascal chips, which run at 300 watts, would still deliver better performance per watt – specifically, 70.7 gigaflops per watt compared to the hypothetical Knights Mill chip based on Knights Landing we are talking about above, which would deliver 56 gigaflops per watt. The "Knights Corner" chip from 2013 was rated at a slightly more than 2 teraflops single precision, and the Knights Landing chip from this year is rated at 6.92 teraflops single precision. Thus, we have a strong feeling that the chart above is not to scale, or that Intel showed half precision for the Knights Mill part and single precision for the Knights Corner and Knights Landing parts.
That Tsubame 1.0 machine was comprised of 655 of Sun's eight-socket Galaxy 4 server nodes equipped with two-core Opteron processors with a then-massive 21.4 TB of aggregate main memory and a peak theoretical performance of 50.4 teraflops at double precision. That network provided 13.5 TB/sec of aggregate bandwidth and 3 Tb/sec of bi-sectional bandwidth, and was linked to a 1 PB Lustre array based on Sun's "Thumper" storage and providing 50 GB/sec of storage I/O bandwidth. The nodes had three of Nvidia's Tesla M2050 accelerators, based on its "Fermi" generation of GPUs, and each node provided at total of 1.6 teraflops of compute, 400 GB/sec of memory bandwidth, and 80 Gb/sec of network bandwidth. With all of those CPUs and GPUs, Tsubame 3.0 will have 12.15 petaflops of peak double precision performance, and is rated at 24.3 petaflops single precision and, importantly, is rated at 47.2 petaflops at the half precision that is important for neural networks employed in deep learning applications.