Collaborating Authors

Enabling fairer data clusters for machine learning


Research published recently by CSE investigators can make training machine learning (ML) models fairer and faster. With a tool called AlloX, Prof. Mosharaf Chowdhury and a team from Stony Brook University developed a new way to fairly schedule high volumes of ML jobs in data centers that make use of multiple different types of computing hardware, like CPUs, GPUs, and specialized accelerators. As these so-called heterogeneous clusters grow to be the norm, fair scheduling systems like AlloX will become essential to their efficient operation. This project is a new step for Chowdhury's lab, which has recently published a number of tools aimed at speeding up the process of training and testing ML models. Their past projects Tiresias and Salus sped up GPU resource sharing at multiple scales: both within a single GPU (Salus) and across many GPUs in a cluster (Tiresias).

Twitter round-up: Google's neural machine translation system most popular AI tweet in August 2020


Verdict lists ten of the most popular tweets on artificial intelligence (AI) in August 2020 based on data from GlobalData's Influencer Platform. The top tweets were chosen from influencers as tracked by GlobalData's Influencer Platform, which is based on a scientific process that works on pre-defined parameters. Influencers are selected after a deep analysis of the influencer's relevance, network strength, engagement, and leading discussions on new and emerging trends. Ronald van Loon, principal analyst and CEO of Intelligent World, shared a video from the World Economic Forum on a neural machine translation technology developed by Google to provide natural translation between different languages using artificial intelligence and deep learning. The system was also used to translate two languages without using English as a bridge.

Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters Artificial Intelligence

Efficient GPU resource scheduling is essential to maximize resource utilization and save training costs for the increasing amount of deep learning workloads in shared GPU clusters. Existing GPU schedulers largely rely on static policies to leverage the performance characteristics of deep learning jobs. However, they can hardly reach optimal efficiency due to the lack of elasticity. To address the problem, we propose ONES, an ONline Evolutionary Scheduler for elastic batch size orchestration. ONES automatically manages the elasticity of each job based on the training batch size, so as to maximize GPU utilization and improve scheduling efficiency. It determines the batch size for each job through an online evolutionary search that can continuously optimize the scheduling decisions. We evaluate the effectiveness of ONES with 64 GPUs on TACC's Longhorn supercomputers. The results show that ONES can outperform the prior deep learning schedulers with a significantly shorter average job completion time.

Themis: Fair and Efficient GPU Cluster Scheduling


For facilitating the execution of distributed Machine Learning (ML) training workloads, GPU clusters are the mainstream infrastructure. However, when multiple of these workloads execute on a shared cluster, a significant contention occurs. The authors of Themis [1] mention that available cluster scheduling mechanisms are not fit for ML training workloads' unique characteristics. ML training workloads are usually long-running jobs that need to be gang-scheduled, and their performance is sensitive to tasks' relative placement. They propose Themis [1] as a new scheduling framework for ML training workloads.

Enabling Level-4 Autonomous Driving on a Single $1k Off-the-Shelf Card Artificial Intelligence

Autonomous driving is of great interest in both research and industry. The high cost has been one of the major roadblocks that slow down the development and adoption of autonomous driving in practice. This paper, for the first-time, shows that it is possible to run level-4 (i.e., fully autonomous driving) software on a single off-the-shelf card (Jetson AGX Xavier) for less than $1k, an order of magnitude less than the state-of-the-art systems, while meeting all the requirements of latency. The success comes from the resolution of some important issues shared by existing practices through a series of measures and innovations. The study overturns the common perceptions of the computing resources required by level-4 autonomous driving, points out a promising path for the industry to lower the cost, and suggests a number of research opportunities for rethinking the architecture, software design, and optimizations of autonomous driving.