September 28, 2017 -- Cirrascale Cloud Services, a premier provider of multi-GPU deep learning cloud solutions, today announced it will begin offering NVIDIA Tesla V100 GPU accelerators as part of its dedicated, multi-GPU deep learning cloud service offerings. The Tesla V100 specifications are impressive with 16GB of HBM2 stacked memory, 5,120 CUDA cores and 640 Tensor Cores, providing 7.8 TFlops double-precision performance, 15.7 TFlops single-precision performance, and 125 TFlops mixed-precision deep learning performance. "Deploying the new NVIDIA Tesla V100 GPU accelerators within the Cirrascale Cloud Services platform will enable their customers to accelerate deep learning and HPC applications using the world's most advanced data center GPUs." To learn more about Cirrascale Cloud Services and its unique dedicated, multi-GPU cloud solutions, please visit http://www.cirrascale.cloud Cirrascale Cloud Services, Cirrascale and the Cirrascale Cloud Services logo are trademarks or registered trademarks of Cirrascale Cloud Services LLC.
The major server vendors are lining up behind Nvidia's Tesla V100 GPU accelerator in a move that is expected to make artificial intelligence and machine learning workloads more mainstream. Dell EMC, HPE, IBM and Supermicro outlined servers on Nvidia's latest GPU accelerators, which are based on the Volta architecture from the graphics chip maker. That throughput effectively takes the speed limit off AI workloads. In a blog post, IBM's Brad McCredie, vice president of the Big Blue's cognitive system development, noted that Nvidia with the V100 as well as its NVLINK PCI-Express 4 and Memory Coherence technology brings "unprecedented internal bandwidth" to AI-optimized systems.
Further research told me that along with FPGA (Field Programmable Field Gate Array), there's an embedded Intel Processor Graphics for deep learning inference. Unlike the Project BrainWave of Microsoft (which only relies on Altera's Stratix 10 FPGA to accelerate deep learning inference), Intel's Inference Engine design uses integrated GPUs alongside FPGAs. However, embedded Intel's Processor Graphics and Altera's Stratix 10 FPGA could be the top hardware products for deep learning inference accelerations. Marketing its embedded graphics processors to accelerate deep learning/artificial intelligence computing is one more reason for us to stay long INTC.
The graphics specialist has been applying its graphics processing units (GPUs) to train AI models, setting itself up to tap an AI chip market that could be worth $16 billion in 2022, according to Markets and Markets. The company launched its first-generation DRIVE PX platform two years ago, hoping to partner with automakers and develop self-driving cars. All of these partnerships have pushed NVIDIA's automotive revenue from just $56 million at the end of fiscal year 2015 to $140 million in the first quarter of fiscal 2018. NVIDIA saw this trend early and launched its Tesla GPU accelerators around five years ago for supercomputing applications.
NVIDIA today launched a partner program with the world's leading original design manufacturers (ODM) -- Foxconn, Inventec, Quanta and Wistron -- to more rapidly meet the demands for AI cloud computing. Through the NVIDIA HGX Partner Program, NVIDIA is providing each ODM with early access to the NVIDIA HGX reference architecture, NVIDIA GPU computing technologies and design guidelines. The standard HGX design architecture includes eight NVIDIA Tesla GPU accelerators in the SXM2 form factor and connected in a cube mesh using NVIDIA NVLink high-speed interconnects and optimized PCIe topologies. "Through this new partner program with NVIDIA, we will be able to more quickly serve the growing demands of our customers, many of whom manage some of the largest data centers in the world," said Taiyu Chou, general manager of Foxconn/Hon Hai Precision Ind Co., Ltd., and president of Ingrasys Technology Inc. "Early access to NVIDIA GPU technologies and design guidelines will help us more rapidly introduce innovative products for our customers' growing AI computing needs."
Although the keynote was heavy on artificial intelligence technologies like NVIDIA Isaac and NVIDIA Volta, Jensen also announced other technologies like GeForce GTX with Max-Q Design. Jensen Huang introduces Volta – the new Tesla V100 accelerator. And this is the NVIDIA HGX server with EIGHT NVIDIA Tesla V100 accelerators producing 960 TFLOPS! The NVIDIA Isaac Initiative will be built around NVIDIA Jetson 2 and the Isaac Robot Simulator, which allows the robots to train themselves in a virtual world.
Nvidia HGX is a kind of starter recipe for original design manufacturers (ODMs) -- Foxconn, Inventec, Quanta, and Wistron -- to package GPUs in data center computers, said Ian Buck, general manager of accelerated computing at Nvidia, in an interview with VentureBeat. Using the recipe, ODMs can quickly design GPU-based systems for hyperscale data centers. As the overall demand for AI computing resources has risen sharply over the past year, so has the market adoption and performance of Nvidia's GPU computing platform. The standard HGX design includes eight Nvidia Tesla GPUs, connected in a mesh using Nvidia's NVLink high-speed interconnect system.
According to Nvidia, Tensor Cores can make the Tesla V100 up to 12x faster for deep learning applications compared to the company's previous Tesla P100 accelerator. In other words, chip companies are battling each other to improve Google's open sourced machine learning framework - a situation that can only benefit Google. A group of eight Tensor Cores in an SM perform a total of 1024 floating point operations per clock. A bigger threat to Nvidia may be other companies that develop and sell specialized machine learning chips with better performance/watt and cost metrics than a typical GPU can offer to other customers.
Architected to deliver higher performance, the Volta SM has lower instruction and cache latencies than past SM designs and includes new features to accelerate deep learning applications. Unlike Pascal GPUs, which could not execute FP32 and INT32 instructions simultaneously, the Volta GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32 operations at full throughput, while also increasing instruction issue throughput. Tensor Cores provide up to 12x higher peak TFLOPS on Tesla V100 for deep learning training compared to P100 FP32 operations, and for deep learning inference, up to 6x higher peak TFLOPS compared to P100 FP16 operations. Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock (FP16 input multiply with full-precision product and FP32 accumulate, as Figure 8 shows) and 8 Tensor Cores in an SM perform a total of 1024 floating point operations per clock.
Architected to deliver higher performance, the Volta SM has lower instruction and cache latencies than past SM designs and includes new features to accelerate deep learning applications. Unlike Pascal GPUs, which could not execute FP32 and INT32 instructions simultaneously, the Volta GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32 operations at full throughput, while also increasing instruction issue throughput. Tensor Cores provide up to 12x higher peak TFLOPS on Tesla V100 for deep learning training compared to P100 FP32 operations, and for deep learning inference, up to 6x higher peak TFLOPS compared to P100 FP16 operations. Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock (FP16 multiply and FP32 accumulate) and 8 Tensor Cores in an SM perform a total of 1024 floating point operations per clock.