Bai, Wei
Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
Wu, Yongji, Qu, Wenjie, Tao, Tianyang, Wang, Zhuang, Bai, Wei, Li, Zhuohao, Tian, Yuan, Zhang, Jiaheng, Lentz, Matthew, Zhuo, Danyang
Sparsely-activated Mixture-of-Experts (MoE) architecture has increasingly been adopted to further scale large language models (LLMs) due to its sub-linear scaling for computation costs. However, frequent failures still pose significant challenges as training scales. The cost of even a single failure is significant, as all GPUs need to wait idle until the failure is resolved, potentially losing considerable training progress as training has to restart from checkpoints. Existing solutions for efficient fault-tolerant training either lack elasticity or rely on building resiliency into pipeline parallelism, which cannot be applied to MoE models due to the expert parallelism strategy adopted by the MoE architecture. We present Lazarus, a system for resilient and elastic training of MoE models. Lazarus adaptively allocates expert replicas to address the inherent imbalance in expert workload and speeds-up training, while a provably optimal expert placement algorithm is developed to maximize the probability of recovery upon failures. Through adaptive expert placement and a flexible token dispatcher, Lazarus can also fully utilize all available nodes after failures, leaving no GPU idle. Our evaluation shows that Lazarus outperforms existing MoE training systems by up to 5.7x under frequent node failures and 3.4x on a real spot instance trace.
Discover the Hidden Attack Path in Multi-domain Cyberspace Based on Reinforcement Learning
Zhang, Lei, Bai, Wei, Li, Wei, Xia, Shiming, Zheng, Qibin
In this work, we present a learning-based approach to analysis cyberspace security configuration. Unlike prior methods, our approach has the ability to learn from past experience and improve over time. In particular, as we train over a greater number of agents as attackers, our method becomes better at discovering hidden attack paths for previously methods, especially in multi-domain cyberspace. To achieve these results, we pose discovering attack paths as a Reinforcement Learning (RL) problem and train an agent to discover multi-domain cyberspace attack paths. To enable our RL policy to discover more hidden attack paths and shorter attack paths, we ground representation introduction an multi-domain action select module in RL. Our objective is to discover more hidden attack paths and shorter attack paths by our proposed method, to analysis the weakness of cyberspace security configuration. At last, we designed a simulated cyberspace experimental environment to verify our proposed method, the experimental results show that our method can discover more hidden multi-domain attack paths and shorter attack paths than existing baseline methods.
Quantitative Evaluations on Saliency Methods: An Experimental Study
Li, Xiao-Hui, Shi, Yuhan, Li, Haoyang, Bai, Wei, Song, Yuanwei, Cao, Caleb Chen, Chen, Lei
It has been long debated that eXplainable AI (XAI) is an important topic, but it lacks rigorous definition and fair metrics. In this paper, we briefly summarize the status quo of the metrics, along with an exhaustive experimental study based on them, including faithfulness, localization, false-positives, sensitivity check, and stability. With the experimental results, we conclude that among all the methods we compare, no single explanation method dominates others in all metrics. Nonetheless, Gradient-weighted Class Activation Mapping (Grad-CAM) and Randomly Input Sampling for Explanation (RISE) perform fairly well in most of the metrics. Utilizing a set of filtered metrics, we further present a case study to diagnose the classification bases for models. While providing a comprehensive experimental study of metrics, we also examine measuring factors that are missed in current metrics and hope this valuable work could serve as a guide for future research.
Weakness Analysis of Cyberspace Configuration Based on Reinforcement Learning
Zhang, Lei, Bai, Wei, Guo, Shize, Xia, Shiming, Li, Hongmei, Pan, Zhisong
In this work, we present a learning-based approach to analysis cyberspace configuration. Unlike prior methods, our approach has the ability to learn from past experience and improve over time. In particular, as we train over a greater number of agents as attackers, our method becomes better at rapidly finding attack paths for previously hidden paths, especially in multiple domain cyberspace. To achieve these results, we pose finding attack paths as a Reinforcement Learning (RL) problem and train an agent to find multiple domain attack paths. To enable our RL policy to find more hidden attack paths, we ground representation introduction an multiple domain action select module in RL. By designing a simulated cyberspace experimental environment to verify our method. Our objective is to find more hidden attack paths, to analysis the weakness of cyberspace configuration. The experimental results show that our method can find more hidden multiple domain attack paths than existing baselines methods.