Goto

Collaborating Authors

 vsphere


How Virtual GPUs Enhance Sharing in Kubernetes for Machine Learning on VMware vSphere

#artificialintelligence

This optimizes the use of the GPU hardware and it can serve more than one user, reducing costs. A basic level of familiarity with the core concepts in Kubernetes and in GPU Acceleration will be useful to the reader of this article. We first look more closely at pods in Kubernetes and how they relate to a GPU. A pod is the unit of deployment, at the lowest level, in Kubernetes. A pod can have one or more containers within it. The lifetime of the containers within a pod tend to be about the same, although one container may start before the others, as the "init" container. You can deploy higher-level objects like Kubernetes services and deployments that have many pods in them. We focus on pods and their use of GPUs in this article. Given access rights to a Tanzu Kubernetes cluster (TKC) running on the VMware vSphere with Tanzu environment (i.e. a set of host servers running the ESXi hypervisor, managed by VMware vCenter), a user can issue the command:


Determining GPU Memory for Machine Learning Applications on VMware vSphere with Tanzu

#artificialintelligence

VMware vSphere with Tanzu provides users with the ability to easily construct a Kubernetes cluster on demand for model development/test or deployment work in machine learning applications. These on-demand clusters are called Tanzu Kubernetes clusters (TKC) and their participating nodes, just like VMs, can be sized as required using a YAML specification. In a TKC running on vSphere with Tanzu, each Kubernetes node is implemented as a virtual machine. Kubernetes pods are scheduled onto these nodes or VMs by the Kubernetes scheduler running in the Control Plane VMs in that cluster. To accelerate machine learning training or inference code, one or more of these pods require a GPU or virtual GPU (vGPU) to be associated with them.


Nvidia adds container support into AI Enterprise suite

#artificialintelligence

Nvidia has rolled out the latest version of its AI Enterprise suite for GPU-accelerated workloads, adding integration for VMware's vSphere with Tanzu to enable organisations to run workloads in both containers and inside virtual machines. Available now, Nvidia AI Enterprise 1.1 is an updated release of the suite that GPUzilla delivered last year in collaboration with VMware. It is essentially a collection of enterprise-grade AI tools and frameworks certified and supported by Nvidia to help organisations develop and operate a range of AI applications. That's so long as those organisations are running VMware, of course, which a great many enterprises still use in order to manage virtual machines across their environment, but many also do not. However, as noted by Gary Chen, research director for Software Defined Compute at IDC, deploying AI workloads is a complex task requiring orchestration across many layers of infrastructure.


Scaling Distributed Machine Learning leveraging vSphere, Bitfusion and NVIDIA GPU (Part 1 of 2) - Virtualize Applications

#artificialintelligence

Organization are quickly embracing Artificial Intelligence (AI), Machine Learning and Deep Learning to open new opportunities and accelerate business growth. AI Workloads, however, require massive compute power and has led to the proliferation of GPU acceleration in addition to traditional CPU power. This has led to a break in the traditional data center architecture and amplification of organizational silos, poor utilization and lack of agility. While virtualization technologies have proven themselves in the enterprise with cost effective, scalable and reliable IT computing, Machine Learning infrastructure however has not evolved and is still bound to dedicating physical resources to optimize and reduce training times. Bitfusion helps enterprises dis-aggregate the GPU compute and dynamically attach GPUs anywhere in the datacenter just like attaching storage.


Distributed Machine Learning on VMware vSphere with GPUs and Kubernetes: a Webinar - Virtualize Applications

#artificialintelligence

This article directs you to a recent webinar that VMware produced on the topic of executing distributed machine learning with TensorFlow and Horovod running on a set of VMs on multiple vSphere host servers. Many machine learning problems are tackled using a single host server today (with a collection of VMs on that host). However, when your ML model or data grows too large for one host to handle, or your GPU power happens to be dispersed across several physical host servers/VMs, then distribution is the mechanism used to tackle that scenario. The VMware webinar introduces the concepts of machine learning in general first. It then gives a short description of Horovod for distributed training and explains the importance of low latency networking between the nodes in the distributed model, based here on Mellanox RDMA over Converged Ethernet (RoCE) technology.


VMware's Project Magna applies machine learning to automate the data centre – Blocks and Files

#artificialintelligence

VMware is developing a cloud service to monitor software in customer deployments and tune it automatically to improve performance. This is Project Magna and its first target is vSAN in hyperconverged infrastructure. It will work like this: customers select their key performance indicator – read or write optimisation or both. Magna examines their vSAN environment and compares it to the KPI average for stored and monitored deployments. If the site is below average, Magna changes it to bring it closer to the average.


ASX approaching artificial intelligence with caution ZDNet

#artificialintelligence

While the Australian Securities Exchange (ASX) makes a global name for itself by implementing one of the only real use cases for distributed ledger technology (DLT) in its blockchain-based CHESS replacement project, its CIO Dan Chesterman has detailed a handful of other tech-related initiatives the "large regulated fintech" is also undertaking. Speaking with ZDNet at VMworld in San Francisco last week, Chesterman said his organisation is looking into the application of artificial intelligence (AI) and machine learning (ML), highlighting that in the ASX's context, there are a lot of examples where machines are making quite clever decisions. "The main exploration we've been doing of artificial intelligence in that AI/ML space has been in market announcements ... it's not in production, it's something we're doing as a proof of concept," he said. "And what we've come to the conclusion of is that we certainly see, in that sort of context, there is actually a serious consequence for any error." Market announcements, for example, is one element of the business where Chesterman said AI could both help, but also cause legal dilemmas.


Mellanox Powers Virtualized Machine Learning with VMware and NVIDIA - insideHPC

#artificialintelligence

Today Mellanox announced that its RDMA (Remote Direct Memory Access) networking solutions for VMware vSphere enable virtualized Machine Learning solutions that achieve higher GPU utilization and efficiency. The benchmark was performed on a four-node cluster running vSphere 6.7 equipped with NVIDIA T4 GPUs with vCS software and Mellanox ConnectX-5 100 GbE SmartNICs, all connected by a Mellanox Spectrum SN2700 100 GbE switch. The PVRDMA Ethernet solution enables VM-to-VM communication over RDMA, which boosts data communication performance in virtualized environments while achieving significantly higher efficiency compared with legacy TCP/IP transports. Additionally, PVRDMA retains core virtual machine capabilities such as vMotion. This translates to real-world customer advantages including optimized server and GPU utilization, reduced machine learning training time and improved scalability.


VMware and Nvidia partner to simplify virtualised GPUs

#artificialintelligence

Nvidia announced its new enterprise software product, vComputeServer, which has been developed and optimised for use with VMware's vSphere. Last week, VMware announced its intention to acquire Carbon Black and Pivotal, in a massive deal that will expand the company's SaaS offerings, while enhancing its ability to enable digital transformation for customers. Before the dust had even settled on that news, the company announced today (26 August), that it is set to launch a hybrid cloud on AWS (Amazon Web Services) in partnership with Nvidia, which will improve GPU (graphics processing unit) virtualisation. The two companies say that this is the first hybrid cloud service that lets enterprises accelerate AI, machine learning or deep learning workloads with GPUs. At the VMWorld conference in San Francisco, Nvidia's VP of product management, John Fanelli, told reporters: "In a modern data centre, organisations are going to be using GPUs to power AI, deep learning and analytics. "Due to the scale of those types of workloads, they're going to be doing some processing on premise in data centres, some processing in clouds and continually iterating between them." The company said that this will make the completion of deep learning training up to 50 times faster than with a CPU alone. This product is aimed at people who may be using Nvidia's Rapids software, Fanelli explained, which is a suite of data processing and machine learning libraries used for GPU-acceleration in data science workflows. Nvidia founder and CEO Jensen Huang said: "From operational intelligence to artificial intelligence, businesses rely on GPU-accelerated computing to make fast, accurate predictions that directly impact their bottom line.


Nvidia, VMware to Bring Virtual GPUs to VMware's AWS Cloud

#artificialintelligence

If you've ever found yourself wishing you could do all the things you've been able to do with a hypervisor and regular virtual machines but on a GPU cluster – in your own data center or in the cloud – Nvidia and VMware are now saying your wish is about to come true. Monday morning, in conjunction with the start of VMworld in San Francisco, the two companies announced that VMware Cloud on AWS, the VMware-operated cloud service running on bare-metal infrastructure in AWS data centers, will soon feature virtualized GPUs you'll be able to provision and manage using the same vSphere tools you use with regular VM infrastructure. You'll be able to share a single physical GPU among multiple VMs, but you'll also be able to aggregate the power of many GPUs to train a machine-learning model at massive scale, the companies said. Related: VMworld: Look at Acquisitions for Virtualization's Cloud Play The play here is to get VMware into the infrastructure mix for the emerging set of enterprise computing workloads that benefit from GPU acceleration, such as AI and machine learning, as well as more traditional Big Data analytics. Also on Monday, the company announced a broad strategy for tackling the hybrid cloud opportunity, which is essentially to provide a single set of tools for managing all enterprise infrastructure, on premises and/or in any public cloud, in a uniform way.