Goto

Collaborating Authors

 Zheng, Shuai


A Pioneering Neural Network Method for Efficient and Robust Fluid Simulation

arXiv.org Artificial Intelligence

Fluid simulation is an important research topic in computer graphics (CG) and animation in video games. Traditional methods based on Navier-Stokes equations are computationally expensive. In this paper, we treat fluid motion as point cloud transformation and propose the first neural network method specifically designed for efficient and robust fluid simulation in complex environments. This model is also the deep learning model that is the first to be capable of stably modeling fluid particle dynamics in such complex scenarios. Our triangle feature fusion design achieves an optimal balance among fluid dynamics modeling, momentum conservation constraints, and global stability control. We conducted comprehensive experiments on datasets. Compared to existing neural network-based fluid simulation algorithms, we significantly enhanced accuracy while maintaining high computational speed. Compared to traditional SPH methods, our speed improved approximately 10 times. Furthermore, compared to traditional fluid simulation software such as Flow3D, our computation speed increased by more than 300 times.


FlexCare: Leveraging Cross-Task Synergy for Flexible Multimodal Healthcare Prediction

arXiv.org Artificial Intelligence

Multimodal electronic health record (EHR) data can offer a holistic assessment of a patient's health status, supporting various predictive healthcare tasks. Recently, several studies have embraced the multitask learning approach in the healthcare domain, exploiting the inherent correlations among clinical tasks to predict multiple outcomes simultaneously. However, existing methods necessitate samples to possess complete labels for all tasks, which places heavy demands on the data and restricts the flexibility of the model. Meanwhile, within a multitask framework with multimodal inputs, how to comprehensively consider the information disparity among modalities and among tasks still remains a challenging problem. To tackle these issues, a unified healthcare prediction model, also named by \textbf{FlexCare}, is proposed to flexibly accommodate incomplete multimodal inputs, promoting the adaption to multiple healthcare tasks. The proposed model breaks the conventional paradigm of parallel multitask prediction by decomposing it into a series of asynchronous single-task prediction. Specifically, a task-agnostic multimodal information extraction module is presented to capture decorrelated representations of diverse intra- and inter-modality patterns. Taking full account of the information disparities between different modalities and different tasks, we present a task-guided hierarchical multimodal fusion module that integrates the refined modality-level representations into an individual patient-level representation. Experimental results on multiple tasks from MIMIC-IV/MIMIC-CXR/MIMIC-NOTE datasets demonstrate the effectiveness of the proposed method. Additionally, further analysis underscores the feasibility and potential of employing such a multitask strategy in the healthcare domain. The source code is available at https://github.com/mhxu1998/FlexCare.


Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping

arXiv.org Artificial Intelligence

The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters. However, it faces the challenge of extended all-to-all communication latency during the training process. Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation. Yet, these methods frequently fall short of achieving sufficient overlap, consequently restricting the potential for performance enhancements. In our study, we extend the scope of this challenge by considering overlap at the broader training graph level. During the forward pass, we enable non-MoE computations to overlap with all-to-all through careful partitioning and pipelining. In the backward pass, we achieve overlap with all-to-all by scheduling gradient weight computations. We implement these techniques in Lancet, a system using compiler-based optimization to automatically enhance MoE model training. Our extensive evaluation reveals that Lancet significantly reduces the time devoted to non-overlapping communication, by as much as 77%. Moreover, it achieves a notable end-to-end speedup of up to 1.3 times when compared to the state-of-the-art solutions.


SiGNN: A Spike-induced Graph Neural Network for Dynamic Graph Representation Learning

arXiv.org Artificial Intelligence

In the domain of dynamic graph representation learning (DGRL), the efficient and comprehensive capture of temporal evolution within real-world networks is crucial. Spiking Neural Networks (SNNs), known as their temporal dynamics and low-power characteristic, offer an efficient solution for temporal processing in DGRL task. However, owing to the spike-based information encoding mechanism of SNNs, existing DGRL methods employed SNNs face limitations in their representational capacity. Given this issue, we propose a novel framework named Spike-induced Graph Neural Network (SiGNN) for learning enhanced spatialtemporal representations on dynamic graphs. In detail, a harmonious integration of SNNs and GNNs is achieved through an innovative Temporal Activation (TA) mechanism. Benefiting from the TA mechanism, SiGNN not only effectively exploits the temporal dynamics of SNNs but also adeptly circumvents the representational constraints imposed by the binary nature of spikes. Furthermore, leveraging the inherent adaptability of SNNs, we explore an in-depth analysis of the evolutionary patterns within dynamic graphs across multiple time granularities. This approach facilitates the acquisition of a multiscale temporal node representation.Extensive experiments on various real-world dynamic graph datasets demonstrate the superior performance of SiGNN in the node classification task.


Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training

arXiv.org Artificial Intelligence

Recent years have seen an increase in the development of large deep learning (DL) models, which makes training efficiency crucial. Common practice is struggling with the trade-off between usability and performance. On one hand, DL frameworks such as PyTorch use dynamic graphs to facilitate model developers at a price of sub-optimal model training performance. On the other hand, practitioners propose various approaches to improving the training efficiency by sacrificing some of the flexibility, ranging from making the graph static for more thorough optimization (e.g., XLA) to customizing optimization towards large-scale distributed training (e.g., DeepSpeed and Megatron-LM). In this paper, we aim to address the tension between usability and training efficiency through separation of concerns. Inspired by DL compilers that decouple the platform-specific optimizations of a tensor-level operator from its arithmetic definition, this paper proposes a schedule language, Slapo, to decouple model execution from definition. Specifically, Slapo works on a PyTorch model and uses a set of schedule primitives to convert the model for common model training optimizations such as high-performance kernels, effective 3D parallelism, and efficient activation checkpointing. Compared to existing optimization solutions, Slapo progressively optimizes the model "as-needed" through high-level primitives, and thus preserving programmability and debuggability for users to a large extent. Our evaluation results show that by scheduling the existing hand-crafted optimizations in a systematic way using Slapo, we are able to improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs, and by up to 1.41x on multiple machines with up to 64 GPUs, when compared to the out-of-the-box performance of DeepSpeed and Megatron-LM.


Contractive error feedback for gradient compression

arXiv.org Artificial Intelligence

On-device memory concerns in distributed deep learning have become severe due to (i) the growth of model size in multi-GPU training, and (ii) the wide adoption of deep neural networks for federated learning on IoT devices which have limited storage. In such settings, communication efficient optimization methods are attractive alternatives, however they still struggle with memory issues. To tackle these challenges, we propose an communication efficient method called contractive error feedback (ConEF). As opposed to SGD with error-feedback (EFSGD) that inefficiently manages memory, ConEF obtains the sweet spot of convergence and memory usage, and achieves communication efficiency by leveraging biased and all-reducable gradient compression. We empirically validate ConEF on various learning tasks that include image classification, language modeling, and machine translation and observe that ConEF saves 80\% - 90\% of the extra memory in EFSGD with almost no loss on test performance, while also achieving 1.3x - 5x speedup of SGD. Through our work, we also demonstrate the feasibility and convergence of ConEF to clear up the theoretical barrier of integrating ConEF to popular memory efficient frameworks such as ZeRO-3.


DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines

arXiv.org Artificial Intelligence

Multi-task model training has been adopted to enable a single deep neural network model (often a large language model) to handle multiple tasks (e.g., question answering and text summarization). Multi-task training commonly receives input sequences of highly different lengths due to the diverse contexts of different tasks. Padding (to the same sequence length) or packing (short examples into long sequences of the same length) is usually adopted to prepare input samples for model training, which is nonetheless not space or computation efficient. This paper proposes a dynamic micro-batching approach to tackle sequence length variation and enable efficient multi-task model training. We advocate pipeline-parallel training of the large model with variable-length micro-batches, each of which potentially comprises a different number of samples. We optimize micro-batch construction using a dynamic programming-based approach, and handle micro-batch execution time variation through dynamic pipeline and communication scheduling, enabling highly efficient pipeline training. Extensive evaluation on the FLANv2 dataset demonstrates up to 4.39x higher training throughput when training T5, and 3.25x when training GPT, as compared with packing-based baselines. DynaPipe's source code is publicly available at https://github.com/awslabs/optimizing-multitask-training-through-dynamic-pipelines.


Node-oriented Spectral Filtering for Graph Neural Networks

arXiv.org Artificial Intelligence

Graph neural networks (GNNs) have shown remarkable performance on homophilic graph data while being far less impressive when handling non-homophilic graph data due to the inherent low-pass filtering property of GNNs. In general, since real-world graphs are often complex mixtures of diverse subgraph patterns, learning a universal spectral filter on the graph from the global perspective as in most current works may still suffer from great difficulty in adapting to the variation of local patterns. On the basis of the theoretical analysis of local patterns, we rethink the existing spectral filtering methods and propose the node-oriented spectral filtering for graph neural network (namely NFGNN). By estimating the node-oriented spectral filter for each node, NFGNN is provided with the capability of precise local node positioning via the generalized translated operator, thus discriminating the variations of local homophily patterns adaptively. Meanwhile, the utilization of re-parameterization brings a good trade-off between global consistency and local sensibility for learning the node-oriented spectral filters. Furthermore, we theoretically analyze the localization property of NFGNN, demonstrating that the signal after adaptive filtering is still positioned around the corresponding node. Extensive experimental results demonstrate that the proposed NFGNN achieves more favorable performance.


Unleashing the potential of GNNs via Bi-directional Knowledge Transfer

arXiv.org Artificial Intelligence

Based on the message-passing paradigm, there has been an amount of research proposing diverse and impressive feature propagation mechanisms to improve the performance of GNNs. However, less focus has been put on feature transformation, another major operation of the message-passing framework. In this paper, we first empirically investigate the performance of the feature transformation operation in several typical GNNs. Unexpectedly, we notice that GNNs do not completely free up the power of the inherent feature transformation operation. By this observation, we propose the Bi-directional Knowledge Transfer (BiKT), a plug-and-play approach to unleash the potential of the feature transformation operations without modifying the original architecture. Taking the feature transformation operation as a derived representation learning model that shares parameters with the original GNN, the direct prediction by this model provides a topological-agnostic knowledge feedback that can further instruct the learning of GNN and the feature transformations therein. On this basis, BiKT not only allows us to acquire knowledge from both the GNN and its derived model but promotes each other by injecting the knowledge into the other. In addition, a theoretical analysis is further provided to demonstrate that BiKT improves the generalization bound of the GNNs from the perspective of domain adaption. An extensive group of experiments on up to 7 datasets with 5 typical GNNs demonstrates that BiKT brings up to 0.5% - 4% performance gain over the original GNN, which means a boosted GNN is obtained. Meanwhile, the derived model also shows a powerful performance to compete with or even surpass the original GNN, enabling us to flexibly apply it independently to some other specific downstream tasks.


Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

arXiv.org Artificial Intelligence

This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg). Our code is available at https://github.com/amazon-science/prompt-pretraining.