stfc
Optimizing AI and Deep Learning Performance
As AI and deep learning uses skyrocket, organizations are finding they are running these systems on similar resource as they do with high-performance computing (HPC) systems – and wondering if this is the path to peak efficiency. Ostensibly AI and HPC architectures have a lot in common, as AI has evolved into even more data-intensive machine learning (ML) and deep learning (DL) domains (Figure 1). Workloads often require multiple GPU systems as a cluster, and share those systems in a coordinated way among multiple data scientists. Secondly, both AI and HPC workloads require shared access to data at a high level of performance and communicate over a fast RDMA-enabled network. Especially in scientific research, the classic HPC systems nowadays tend to have GPUs added to the compute nodes to have the same cluster suitable for classic HPC and new AI/DL workloads.
STFC Machine Learning Group Deploys Elastic NVMe Storage to Power GPU Servers - insideHPC
At SC19, Excelero announced that the Science and Technology Facilities Council (STFC) has deployed a new HPC architecture to support computationally intensive analysis including machine learning and AI-based workloads using the NVMesh elastic NVMe block storage solution. Done in partnership with Boston Limited, the deployment is enabling researchers from STFC and the Alan Turing Institute to complete machine learning training tasks that formerly took three to four days, in just one hour – and other foundational scientific computations that researchers formerly could not perform. The Science and Technology Facilities Council is a part of U.K. Research and Innovation (UKRI) and supports pioneering scientific and engineering research by over 1,700 academic researchers worldwide on space materials and life sciences, nuclear physics and much more. In benchmark testing we quickly saw that our Flash-IO Talyn systems with the Excelero NVMesh software delivered a significant performance ...