Goto

Collaborating Authors

 deep learning workload


HammingMesh: A Network Topology for Large-Scale Deep Learning

Communications of the ACM

Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investigate data-movement characteristics of large-scale training at full system scale. Based on our workload analysis, we design HammingMesh, a novel network topology that provides high bandwidth at low cost with high job scheduling flexibility. Specifically, HammingMesh can support full bandwidth and isolation to deep learning training jobs with two dimensions of parallelism. Furthermore, it also supports high global bandwidth for generic traffic.


DeepContext: A Context-aware, Cross-platform, and Cross-framework Tool for Performance Profiling and Analysis of Deep Learning Workloads

Zhao, Qidong, Wu, Hao, Hao, Yuming, Ye, Zilingfeng, Li, Jiajia, Liu, Xu, Zhou, Keren

arXiv.org Artificial Intelligence

Effective performance profiling and analysis are essential for optimizing training and inference of deep learning models, especially given the growing complexity of heterogeneous computing environments. However, existing tools often lack the capability to provide comprehensive program context information and performance optimization insights for sophisticated interactions between CPUs and GPUs. This paper introduces DeepContext, a novel profiler that links program contexts across high-level Python code, deep learning frameworks, underlying libraries written in C/C++, as well as device code executed on GPUs. DeepContext incorporates measurements of both coarse- and fine-grained performance metrics for major deep learning frameworks, such as PyTorch and JAX, and is compatible with GPUs from both Nvidia and AMD, as well as various CPU architectures, including x86 and ARM. In addition, DeepContext integrates a novel GUI that allows users to quickly identify hotpots and an innovative automated performance analyzer that suggests users with potential optimizations based on performance metrics and program context. Through detailed use cases, we demonstrate how DeepContext can help users identify and analyze performance issues to enable quick and effective optimization of deep learning workloads. We believe Deep Context is a valuable tool for users seeking to optimize complex deep learning workflows across multiple compute environments.


Microsoft's 'Singularity' to Enable Global Accelerator Network for AI Training

#artificialintelligence

In science fiction and future studies, the word "singularity" is invoked in reference to a rapidly snowballing artificial intelligence that, repeatedly iterating on itself, eclipses all human knowledge and ability. It is this word that Microsoft--perhaps ambitiously--has invoked for its new AI project, a "globally distributed scheduling service for highly efficient and reliable execution of deep learning training and inference workloads." Microsoft's Singularity is a response to the computational costs of training deep learning workloads--costs that have quickly spiraled as those workloads have grown in size, complexity and number. It is also an attempt to maximize the use of idle time, which has increasingly become a focus of discussions of how to minimize the costs and environmental footprints of high-performance computing systems and AI model training on such systems. "Singularity is built with one key goal," explains the preprint paper, which was written by a team of more than two dozen Microsoft researchers and published on arXiv, "driving down the cost of AI by maximizing the aggregate useful throughput on a given fixed pool of capacity of accelerators on a planet scale, while providing stringent [service-level agreements] for multiple pricing tiers."


HPTMT Parallel Operators for High Performance Data Science & Data Engineering

Abeykoon, Vibhatha, Kamburugamuve, Supun, Widanage, Chathura, Perera, Niranda, Uyar, Ahmet, Kanewala, Thejaka Amila, von Laszewski, Gregor, Fox, Geoffrey

arXiv.org Artificial Intelligence

Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient data abstractions and operators that suit the applications of different domains. Often lack of a clear definition of data structures and operators in the field has led to other implementations that do not work well together. The HPTMT architecture that we proposed recently, identifies a set of data structures, operators, and an execution model for creating rich data applications that links all aspects of data engineering and data science together efficiently. This paper elaborates and illustrates this architecture using an end-to-end application with deep learning and data engineering parts working together.


Cloud GPU Instances: What Are the Options? - DATAVERSITY

#artificialintelligence

Click here to learn more about Gilad David Maayan. If you're running demanding machine learning and deep learning models on your laptop or on GPU-equipped machines owned by your organization, there is a new and compelling alternative. All major cloud providers offer cloud GPUs – compute instances with powerful hardware acceleration, which you can rent per hour, letting you run deep learning workloads on the cloud. Let's review the concept of cloud GPUs and the offerings by the big three cloud providers – Amazon, Azure, and Google Cloud. A cloud graphics processing unit (GPU) provides hardware acceleration for an application, without requiring that a GPU is deployed on the user's local device.


Value Function Based Performance Optimization of Deep Learning Workloads

Steiner, Benoit, Cummins, Chris, He, Horace, Leather, Hugh

arXiv.org Artificial Intelligence

As machine learning techniques become ubiquitous, the efficiency of neural network implementations is becoming correspondingly paramount. Frameworks, such as Halide and TVM, separate out the algorithmic representation of the network from the schedule that determines its implementation. Finding good schedules, however, remains extremely challenging. We model this scheduling problem as a sequence of optimization choices, and present a new technique to accurately predict the expected performance of a partial schedule. By leveraging these predictions we can make these optimization decisions greedily and rapidly identify an efficient schedule. This enables us to find schedules that improve the throughput of deep neural networks by 2.6x over Halide and 1.5x over TVM. Moreover, our technique is two to three orders of magnitude faster than that of these tools, and completes in seconds instead of hours.


Deep Learning: What You Need To Know

#artificialintelligence

During the past decade, deep learning has seen groundbreaking developments in the field of AI (Artificial Intelligence). But what is this technology? And why is it so important? Well, let's first get a definition of deep learning. Here's how Kalyan Kumar, who is the Corporate Vice President & Chief Technology Officer of IT Services at HCL Technologies, describes it: "Have you ever wondered how our brain can recognize the face of a friend whom you had met years ago or can recognize the voice of your mother among so many other voices in a crowded marketplace or how our brain can learn, plan and execute complex day-to-day activities? The human brain has around 100 billion cells called neurons. These build massively parallel and distributed networks, through which we learn and carry out complex activities. Inspired from these biological neural networks, scientists started building artificial neural networks so that computers could eventually learn and exhibit intelligence like humans."


Managing GPU workloads with Univa Grid Engine - Univa Corporation

#artificialintelligence

For almost two decades, GPUs (Graphics Processing Units) have been steadily revolutionizing high-performance computing (HPC) and AI. Originally designed for graphics-intensive applications such as gaming and image processing, it didn't take long for HPC professionals to see the potential of low-cost, massively parallel processors able to handle then billions (and now trillions) of floating-point operations per second. In this two-part article, I'll discuss GPU workloads and how they are managed with Univa Grid Engine. First, I'll provide a short primer on GPUs, explain how they are used in HPC and AI, and cover some of the specific challenges when running GPU applications on shared clusters. In part II, I'll focus on some of the specific innovations in Univa Grid Engine that help make GPU applications much easier to deploy and manage at scale.


Fueling AI innovation with a new breed of accelerated computing

#artificialintelligence

The new HPE Apollo 6500 Gen10 is a groundbreaking server designed to tackle the most compute-intensive HPC and deep learning workloads. With superior speed, density, and performance, HPE is reinventing what it means to compute. A major transformation is happening now, as technological advancements and escalating volumes of diverse data drive change across all industries. Cutting-edge innovations are fueling digital transformation on a global scale, and organizations are leveraging faster, more powerful machines to operate more intelligently and effectively than ever. Today, HPE announced the new HPE Apollo 6500 Gen10 server, a groundbreaking platform designed to tackle the most compute-intensive high performance computing (HPC) and deep learning workloads.


New White Paper: High-Performance Virtualized Spark Clusters on Kubernetes for Deep Learning - VMware VROOM! Blog

#artificialintelligence

A new white paper is available showing the advantages of running virtualized Spark Deep Learning workloads on Kubernetes. Recent versions of Spark include support for Kubernetes. For Spark on Kubernetes, the Kubernetes scheduler provides the cluster manager capability provided by Yet Another Resource Negotiator (YARN) in typical Spark on Hadoop clusters. Upon receiving a spark-submit command to start an application, Kubernetes instantiates the requested number of Spark executor pods, each with one or more Spark executors. The benefits of running Spark on Kubernetes are many: ease of deployment, resource sharing, simplifying the coordination between developer and cluster administrator, and enhanced security.