Goto

Collaborating Authors

 host server


vSphere 8 Expands Machine Learning Support: Device Groups for NVIDIA GPUs and NICs

#artificialintelligence

Data scientists and machine learning developers are building and training very large models these days with more extensive GPU memory needs. Many of these larger ML applications need more than one NVIDIA GPU device on the vSphere servers on which they operate or they may need to communicate between separate GPUs over the local network. This can be done for the purpose of expanding the overall GPU framebuffer memory capacity or for other reasons. Servers now exist on the market with eight or more physical GPUs in them and that number of GPUs per server will likely grow over time. With vSphere 8, you have the capability to add up to 8 virtual GPUs (vGPUs) to one VM.


Performance at Scale: Graphcore's Latest MLPerf Training Results

#artificialintelligence

Graphcore's latest submission to MLPerf demonstrates two things very clearly โ€“ our IPU systems are getting larger and more efficient, and our software maturity means they are also getting faster and easier to use. Software optimisation continues to deliver significant performance gains, with our IPU-POD16 now outperforming Nvidia's DGX A100 for computer vision model, ResNet-50. Training ResNet-50 takes 28.3 minutes on the IPU-POD16, compared to 29.1 minutes for DGX A100 โ€“ a performance improvement of 24% since our first submission through software alone. It is a significant milestone, given that ResNet-50 has traditionally been a showpiece model for GPUs. Our software-driven performance gain for ResNet-50 on the IPU-POD64 was even greater at 41%.


How Virtual GPUs Enhance Sharing in Kubernetes for Machine Learning on VMware vSphere

#artificialintelligence

This optimizes the use of the GPU hardware and it can serve more than one user, reducing costs. A basic level of familiarity with the core concepts in Kubernetes and in GPU Acceleration will be useful to the reader of this article. We first look more closely at pods in Kubernetes and how they relate to a GPU. A pod is the unit of deployment, at the lowest level, in Kubernetes. A pod can have one or more containers within it. The lifetime of the containers within a pod tend to be about the same, although one container may start before the others, as the "init" container. You can deploy higher-level objects like Kubernetes services and deployments that have many pods in them. We focus on pods and their use of GPUs in this article. Given access rights to a Tanzu Kubernetes cluster (TKC) running on the VMware vSphere with Tanzu environment (i.e. a set of host servers running the ESXi hypervisor, managed by VMware vCenter), a user can issue the command:


Distributed Machine Learning on VMware vSphere with GPUs and Kubernetes: a Webinar - Virtualize Applications

#artificialintelligence

This article directs you to a recent webinar that VMware produced on the topic of executing distributed machine learning with TensorFlow and Horovod running on a set of VMs on multiple vSphere host servers. Many machine learning problems are tackled using a single host server today (with a collection of VMs on that host). However, when your ML model or data grows too large for one host to handle, or your GPU power happens to be dispersed across several physical host servers/VMs, then distribution is the mechanism used to tackle that scenario. The VMware webinar introduces the concepts of machine learning in general first. It then gives a short description of Horovod for distributed training and explains the importance of low latency networking between the nodes in the distributed model, based here on Mellanox RDMA over Converged Ethernet (RoCE) technology.