AITopics | Xiong, Yongqiang

Plotting

Xiong, Yongqiang

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Tutel: Adaptive Mixture-of-Experts at Scale

Hwang, Changho, Cui, Wei, Xiong, Yifan, Yang, Ziyue, Liu, Ze, Hu, Han, Wang, Zilong, Salas, Rafael, Jose, Jithin, Ram, Prabhat, Chau, Joe, Cheng, Peng, Yang, Fan, Yang, Mao, Xiong, Yongqiang

arXiv.org Artificial IntelligenceJun-5-2023

Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning models to trillion-plus parameters with fixed computational cost. The algorithmic performance of MoE relies on its token routing mechanism that forwards each input token to the right sub-models or experts. While token routing dynamically determines the amount of expert workload at runtime, existing systems suffer inefficient computation due to their static execution, namely static parallelism and pipelining, which does not adapt to the dynamic workload. We present Flex, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining. Flex designs an identical layout for distributing MoE model parameters and input data, which can be leveraged by all possible parallelism or pipelining methods without any mathematical inequivalence or tensor migration overhead. This enables adaptive parallelism/pipelining optimization at zero cost during runtime. Based on this key design, Flex also implements various MoE acceleration techniques. Aggregating all techniques, Flex finally delivers huge speedup at any scale -- 4.96x and 5.75x speedup of a single MoE layer over 16 and 2,048 A100 GPUs, respectively, over the previous state-of-the-art. Our evaluation shows that Flex efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture. On efficiency, Flex accelerates SwinV2-MoE, achieving up to 1.55x and 2.11x speedup in training and inference over Fairseq, respectively. On effectiveness, the SwinV2-MoE model achieves superior accuracy in both pre-training and down-stream computer vision tasks such as COCO object detection than the counterpart dense model, indicating the readiness of Flex for end-to-end real-world model training and inference.

all-to-all, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2206.03382

Country: North America > United States (0.14)

Genre: Research Report (0.64)

Industry: Energy (0.34)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

An Adaptive Deep RL Method for Non-Stationary Environments with Piecewise Stable Context

Chen, Xiaoyu, Zhu, Xiangming, Zheng, Yufeng, Zhang, Pushi, Zhao, Li, Cheng, Wenxue, Cheng, Peng, Xiong, Yongqiang, Qin, Tao, Chen, Jianyu, Liu, Tie-Yan

arXiv.org Artificial IntelligenceDec-24-2022

One of the key challenges in deploying RL to real-world applications is to adapt to variations of unknown environment contexts, such as changing terrains in robotic tasks and fluctuated bandwidth in congestion control. Existing works on adaptation to unknown environment contexts either assume the contexts are the same for the whole episode or assume the context variables are Markovian. However, in many real-world applications, the environment context usually stays stable for a stochastic period and then changes in an abrupt and unpredictable manner within an episode, resulting in a segment structure, which existing works fail to address. To leverage the segment structure of piecewise stable context in real-world applications, in this paper, we propose a Segmented Context Belief Augmented Deep (SeCBAD) RL method. Our method can jointly infer the belief distribution over latent context with the posterior over segment length and perform more accurate belief context inference with observed data within the current context segment. The inferred belief context can be leveraged to augment the state, leading to a policy that can adapt to abrupt variations in context. We demonstrate empirically that SeCBAD can infer context segment length accurately and outperform existing methods on a toy grid world environment and MuJoCo tasks with piecewise-stable context.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2212.12735

Country:

Asia (0.28)
North America > United States > Massachusetts (0.14)
North America > United States > California (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

CrossoverScheduler: Overlapping Multiple Distributed Training Applications in a Crossover Manner

Luo, Cheng, Qu, Lei, Miao, Youshan, Cheng, Peng, Xiong, Yongqiang

arXiv.org Artificial IntelligenceMar-14-2021

Distributed deep learning workloads include throughput-intensive training tasks on the GPU clusters, where the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays after backward propagation, forces workers to wait for the gradient synchronization via a centralized parameter server or directly in decentralized workers. We present CrossoverScheduler, an algorithm that enables communication cycles of a distributed training application to be filled by other applications through pipelining communication and computation. With CrossoverScheduler, the running performance of distributed training can be significantly improved without sacrificing convergence rate and network accuracy. We achieve so by introducing Crossover Synchronization which allows multiple distributed deep learning applications to time-share the same GPU alternately. The prototype of CrossoverScheduler is built and integrated with Horovod. Experiments on a variety of distributed tasks show that CrossoverScheduler achieves 20% \times speedup for image classification tasks on ImageNet dataset.

crossoverscheduler, deep learning, neural network, (16 more...)

arXiv.org Artificial Intelligence

2103.07974

Country: North America > United States > California (0.14)

Genre: Research Report (0.50)

Industry: Education (0.72)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Stanza: Layer Separation for Distributed Training in Deep Learning

Wu, Xiaorui, Xu, Hong, Li, Bo, Xiong, Yongqiang

arXiv.org Machine LearningJan-10-2019

The parameter server architecture is prevalently used for distributed deep learning. Each worker machine in a parameter server system trains the complete model, which leads to a hefty amount of network data transfer between workers and servers. We empirically observe that the data transfer has a non-negligible impact on training time. To tackle the problem, we design a new distributed training system called Stanza. Stanza exploits the fact that in many models such as convolution neural networks, most data exchange is attributed to the fully connected layers, while most computation is carried out in convolutional layers. Thus, we propose layer separation in distributed training: the majority of the nodes just train the convolutional layers, and the rest train the fully connected layers only. Gradients and parameters of the fully connected layers no longer need to be exchanged across the cluster, thereby substantially reducing the data transfer volume. We implement Stanza on PyTorch and evaluate its performance on Azure and EC2. Results show that Stanza accelerates training significantly over current parameter server systems: on EC2 instances with Tesla V100 GPU and 10Gb bandwidth for example, Stanza is 1.34x--13.9x faster for common deep learning models.

conv worker, deep learning, neural network, (20 more...)

arXiv.org Machine Learning

1812.10624

Genre: Research Report (0.70)

Industry: Information Technology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback