At SC19, Excelero announced that the Science and Technology Facilities Council (STFC) has deployed a new HPC architecture to support computationally intensive analysis including machine learning and AI-based workloads using the NVMesh elastic NVMe block storage solution. Done in partnership with Boston Limited, the deployment is enabling researchers from STFC and the Alan Turing Institute to complete machine learning training tasks that formerly took three to four days, in just one hour – and other foundational scientific computations that researchers formerly could not perform. The Science and Technology Facilities Council is a part of U.K. Research and Innovation (UKRI) and supports pioneering scientific and engineering research by over 1,700 academic researchers worldwide on space materials and life sciences, nuclear physics and much more. In benchmark testing we quickly saw that our Flash-IO Talyn systems with the Excelero NVMesh software delivered a significant performance ...
Given our focus on the systems-level of AI machine building, storage was a big topic of discussion at the sold-out Next AI Platform event we hosted in May. It was difficult to leave out where NVMe over fabrics and other trends are fitting into AI training systems in particular, so we asked distributed NVM-Express flash storage upstart Excelero, which is a pioneer in creating pools of flash storage that look and behave as if they are directly attached storage for a server's applications, what is it about AI workloads that makes storage a challenge. The answer, according to Josh Goldenhar, vice president of products at Excelero, is revealed in some basic feeds and speeds that show the imbalance in many machine learning systems, either for training or inference. "The answer is pretty straightforward," explains Goldenhar. "If you look at the specs of the latest Nvidia DGX-2 and take the aggregate performance across the cards, the cards themselves can directly process from the memory 14 TB/sec, which is an amazing number. Even though that is an amazingly huge number, we have to count how the cards are hooked into NVLink and PCI-Express, and all of that is x16 to the server, when you add all of that up, it is actually around 256 GB/sec, and that is a considerably lower number. But that is still so much more than has been put into the box."
Data centers that support AI and ML deployments rely on Graphics Processing Unit (GPU)-based servers to power their computationally intensive architectures. Across multiple industries, expansion in GPU use is behind the over 31 percent CAGR in GPU servers projected through 2024. That means more system architects will be tasked to assure top performance and cost-efficiency from GPU systems. Yet optimizing storage for these GPU-based AI/ML workloads is no small feat. GPU servers are highly efficient for the matrix multiplication and convolution required to train large AI/ML datasets.
Businesses are increasingly using data assets to accelerate their competitiveness and drive greater revenue. Part of this strategy is to use machine learning and AI tools and technologies. But AI workloads have significantly different data storage and computing needs than generic workloads. AI and machine learning workloads require huge amounts of data both to build and train the models and to keep them running. When it comes to storage for these workloads, high-performance and long-term data storage are the most important concerns.