At SC19, Excelero announced that the Science and Technology Facilities Council (STFC) has deployed a new HPC architecture to support computationally intensive analysis including machine learning and AI-based workloads using the NVMesh elastic NVMe block storage solution. Done in partnership with Boston Limited, the deployment is enabling researchers from STFC and the Alan Turing Institute to complete machine learning training tasks that formerly took three to four days, in just one hour – and other foundational scientific computations that researchers formerly could not perform. The Science and Technology Facilities Council is a part of U.K. Research and Innovation (UKRI) and supports pioneering scientific and engineering research by over 1,700 academic researchers worldwide on space materials and life sciences, nuclear physics and much more. In benchmark testing we quickly saw that our Flash-IO Talyn systems with the Excelero NVMesh software delivered a significant performance ...
Given our focus on the systems-level of AI machine building, storage was a big topic of discussion at the sold-out Next AI Platform event we hosted in May. It was difficult to leave out where NVMe over fabrics and other trends are fitting into AI training systems in particular, so we asked distributed NVM-Express flash storage upstart Excelero, which is a pioneer in creating pools of flash storage that look and behave as if they are directly attached storage for a server's applications, what is it about AI workloads that makes storage a challenge. The answer, according to Josh Goldenhar, vice president of products at Excelero, is revealed in some basic feeds and speeds that show the imbalance in many machine learning systems, either for training or inference. "The answer is pretty straightforward," explains Goldenhar. "If you look at the specs of the latest Nvidia DGX-2 and take the aggregate performance across the cards, the cards themselves can directly process from the memory 14 TB/sec, which is an amazing number. Even though that is an amazingly huge number, we have to count how the cards are hooked into NVLink and PCI-Express, and all of that is x16 to the server, when you add all of that up, it is actually around 256 GB/sec, and that is a considerably lower number. But that is still so much more than has been put into the box."
Data centers that support AI and ML deployments rely on Graphics Processing Unit (GPU)-based servers to power their computationally intensive architectures. Across multiple industries, expansion in GPU use is behind the over 31 percent CAGR in GPU servers projected through 2024. That means more system architects will be tasked to assure top performance and cost-efficiency from GPU systems. Yet optimizing storage for these GPU-based AI/ML workloads is no small feat. GPU servers are highly efficient for the matrix multiplication and convolution required to train large AI/ML datasets.
IDC performed in-depth interviews (IDIs) with these customers, all of whom will remain anonymous. Vertical markets represented by the interviewees include ecommerce/web hosting, financial services, cloud service providers, healthcare, and geospatial information services. This document discusses the business and technical drivers of the NVMe purchase decision, how they have deployed the technology, which workloads they are using it with, what their experiences with it have been, and what their future plans for the technology are. "NVMe is the future of primary enterprise storage, and many IT organizations -- both cloud service providers and enterprises -- are already integrating it into their environments for production workloads," said Eric Burgener, research vice president, Storage. "NVMe-based storage systems are already available from a number of start-ups as well as some established enterprise storage providers, and IDC expects that by 2021, NVMe-based arrays using NVMe over Fabric host connections will be driving more than 50% of all external primary storage revenue."