Goto

Collaborating Authors

 Zhang, Shunkang


ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference

arXiv.org Artificial Intelligence

Sparse Mixture of Experts (MoE) models, while outperforming dense Large Language Models (LLMs) in terms of performance, face significant deployment challenges during inference due to their high memory demands. Existing offloading techniques, which involve swapping activated and idle experts between the GPU and CPU, often suffer from rigid expert caching mechanisms. These mechanisms fail to adapt to dynamic routing, leading to inefficient cache utilization, or incur prohibitive costs for prediction training. To tackle these inference-specific challenges, we introduce ExpertFlow, a comprehensive system specifically designed to enhance inference efficiency by accommodating flexible routing and enabling efficient expert scheduling between CPU and GPU. This reduces overhead and boosts system performance. Central to our approach is a predictive routing path-based offloading mechanism that utilizes a lightweight predictor to accurately forecast routing paths before computation begins. This proactive strategy allows for real-time error correction in expert caching, significantly increasing cache hit ratios and reducing the frequency of expert transfers, thereby minimizing I/O overhead. Additionally, we implement a dynamic token scheduling strategy that optimizes MoE inference by rearranging input tokens across different batches. This method not only reduces the number of activated experts per batch but also improves computational efficiency. Our extensive experiments demonstrate that ExpertFlow achieves up to 93.72\% GPU memory savings and enhances inference speed by 2 to 10 times compared to baseline methods, highlighting its effectiveness and utility as a robust solution for resource-constrained inference scenarios.


Kimad: Adaptive Gradient Compression with Bandwidth Awareness

arXiv.org Artificial Intelligence

In distributed training, communication often emerges as a bottleneck. In response, we introduce Kimad, a solution that offers adaptive gradient compression. By consistently monitoring bandwidth, Kimad refines compression ratios to match specific neural network layer requirements. Our exhaustive tests and proofs confirm Kimad's outstanding performance, establishing it as a benchmark in adaptive compression for distributed deep learning.


Wasserstein-Wasserstein Auto-Encoders

arXiv.org Machine Learning

To address the challenges in learning deep generative models (e.g.,the blurriness of variational auto-encoder and the instability of training generative adversarial networks, we propose a novel deep generative model, named Wasserstein-Wasserstein auto-encoders (WWAE). We formulate WWAE as minimization of the penalized optimal transport between the target distribution and the generated distribution. By noticing that both the prior $P_Z$ and the aggregated posterior $Q_Z$ of the latent code Z can be well captured by Gaussians, the proposed WWAE utilizes the closed-form of the squared Wasserstein-2 distance for two Gaussians in the optimization process. As a result, WWAE does not suffer from the sampling burden and it is computationally efficient by leveraging the reparameterization trick. Numerical results evaluated on multiple benchmark datasets including MNIST, fashion- MNIST and CelebA show that WWAE learns better latent structures than VAEs and generates samples of better visual quality and higher FID scores than VAEs and GANs.


Deep Generative Learning via Variational Gradient Flow

arXiv.org Machine Learning

Learning the generative model, i.e., the underlying data generating distribution, based on large amounts of data is one the fundamental task in machine learning and statistics [46].Recent advances in deep generative models have provided novel techniques for unsupervised and semi-supervised learning, with broad application varying from image synthesis [44], semantic image editing [60], image-to-image translation [61] to low-level image processing [29]. Implicit deep generative model is a powerful and flexible framework to approximate the target distribution by learning deep samplers [38] including Generative adversarialnetworks (GAN) [16] and likelihood based models, such as variational auto-encoders (VAE) [23] and flow based methods [11], as their main representatives. The above mentioned implicit deep generative models focus on learning a deterministic or stochastic nonlinear mapping that can transform low dimensional latent samples from referenced simple distribution to samples that closely match the target distribution. GANs build a minmax two player game between the generator and discriminator. During the training, the generator transforms samples from a simple reference distribution into samples that would hopefully to deceive the discriminator, while the discriminator conducts a differential two-sample test to distinguish the generated samples from the observed samples. The objective of vanilla GANs amounts to the Jensen-Shannon (JS) divergence between the learned distribution and target distributions. The vanilla GAN generates sharp image samples but suffers form the instability issues [3]. A myriad of extensions to vanilla GANs have been investigated, both theoretically or empirically, in order to achieve a stable training and high quality sample generation.