Not enough data to create a plot.
Try a different view from the menu above.
Zeng, Zhichen
CATS: Mitigating Correlation Shift for Multivariate Time Series Classification
Lin, Xiao, Zeng, Zhichen, Wei, Tianxin, Liu, Zhining, chen, Yuzhong, Tong, Hanghang
Unsupervised Domain Adaptation (UDA) leverages labeled source data to train models for unlabeled target data. Given the prevalence of multivariate time series (MTS) data across various domains, the UDA task for MTS classification has emerged as a critical challenge. However, for MTS data, correlations between variables often vary across domains, whereas most existing UDA works for MTS classification have overlooked this essential characteristic. To bridge this gap, we introduce a novel domain shift, {\em correlation shift}, measuring domain differences in multivariate correlation. To mitigate correlation shift, we propose a scalable and parameter-efficient \underline{C}orrelation \underline{A}dapter for M\underline{TS} (CATS). Designed as a plug-and-play technique compatible with various Transformer variants, CATS employs temporal convolution to capture local temporal patterns and a graph attention module to model the changing multivariate correlation. The adapter reweights the target correlations to align the source correlations with a theoretically guaranteed precision. A correlation alignment loss is further proposed to mitigate correlation shift, bypassing the alignment challenge from the non-i.i.d. nature of MTS data. Extensive experiments on four real-world datasets demonstrate that (1) compared with vanilla Transformer-based models, CATS increases over $10\%$ average accuracy while only adding around $1\%$ parameters, and (2) all Transformer variants equipped with CATS either reach or surpass state-of-the-art baselines.
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
Ai, Mengting, Wei, Tianxin, Chen, Yifan, Zeng, Zhichen, Zhao, Ritchie, Varatkar, Girish, Rouhani, Bita Darvish, Tang, Xianfeng, Tong, Hanghang, He, Jingrui
Mixture-of-Experts (MoE) Transformer, the backbone architecture The profound impact of the Transformer architecture in the domain of multiple phenomenal language models, leverages sparsity of machine learning is undeniable, for the fields including by activating only a fraction of model parameters for each input natural language processing [3, 14, 18, 45, 48, 61] and computer token. The sparse structure, while allowing constant time costs, vision [17, 39, 64], to name a few. To further improve the capabilities results in space inefficiency: we still need to load all the model of pre-trained large language models (LLMs), one general parameters during inference. We introduce ResMoE, an innovative strategy is to scale up their parameters. Mixture-of-Experts (MoE) MoE approximation framework that utilizes Wasserstein barycenter [52] extends the traditional feedforward neural network (FFN) layer to extract a common expert (barycenter expert) and approximate by replacing a single multilayer perceptron (MLP) with multiple the residuals between this barycenter expert and the original ones. MLPs, referred to as "experts". While enhancing the performance, ResMoE enhances the space efficiency for inference of large-scale sparse MoE keeps computing costs (FLOPs) comparable to the original MoE Transformers in a one-shot and data-agnostic manner without dense model, as only a few selected experts will be activated retraining while maintaining minimal accuracy loss, thereby each time. The framework of an MoE layer is demonstrated in paving the way for broader accessibility to large language models.
Exploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs
Wu, Qizhe, Liang, Huawen, Gui, Yuchen, Zeng, Zhichen, He, Zerong, Tao, Linfeng, Wang, Xiaotian, Zhao, Letian, Zeng, Zhaoxi, Yuan, Wei, Wu, Wei, Jin, Xi
General matrix-matrix multiplication (GEMM) is a cornerstone of AI computations, making tensor processing engines (TPEs) increasingly critical in GPUs and domain-specific architectures. Existing architectures primarily optimize dataflow or operand reuse strategies. However, considering the interaction between matrix multiplication and multiply-accumulators (MACs) offers greater optimization potential. This work introduces a novel hardware perspective on matrix multiplication, focusing on the bit-weight dimension of MACs. We propose a finer-grained TPE notation using matrix triple loops as an example, introducing new methods for designing and optimizing PE microarchitectures. Based on this notation and its transformations, we propose four optimization techniques that improve timing, area, and power consumption. Implementing our design in RTL using the SMIC-28nm process, we evaluate its effectiveness across four classic TPE architectures: systolic array, 3D-Cube, multiplier-adder tree, and 2D-Matrix. Our techniques achieve area efficiency improvements of 1.27x, 1.28x, 1.56x, and 1.44x, and energy efficiency gains of 1.04x, 1.56x, 1.49x, and 1.20x, respectively. Applied to a bit-slice architecture, our approach achieves a 12.10x improvement in energy efficiency and 2.85x in area efficiency compared to Laconic. Our Verilog HDL code, along with timing, area, and power reports, is available at https://github.com/wqzustc/High-Performance-Tensor-Processing-Engines
External Large Foundation Model: How to Efficiently Serve Trillions of Parameters for Online Ads Recommendation
Liang, Mingfu, Liu, Xi, Jin, Rong, Liu, Boyang, Suo, Qiuling, Zhou, Qinghai, Zhou, Song, Chen, Laming, Zheng, Hua, Li, Zhiyuan, Jiang, Shali, Yang, Jiyan, Xia, Xiaozhen, Yang, Fan, Badr, Yasmine, Wen, Ellie, Xu, Shuyu, Chen, Hansey, Zhang, Zhengyu, Nie, Jade, Yang, Chunzhi, Zeng, Zhichen, Zhang, Weilin, Huang, Xingliang, Li, Qianru, Wang, Shiquan, Lyu, Evelyn, Lu, Wenjing, Zhang, Rui, Wang, Wenjun, Rudy, Jason, Hang, Mengyue, Wang, Kai, Ma, Yinbin, Wang, Shuaiwen, Zeng, Sihan, Tang, Tongyi, Wei, Xiaohan, Jin, Longhao, Zhang, Jamey, Chen, Marcus, Zhang, Jiayi, Huang, Angie, Zhang, Chi, Zhao, Zhengli, Yang, Jared, Jin, Qiang, Chen, Xian, Amlesahwaram, Amit Anand, Song, Lexi, Luo, Liang, Hao, Yuchen, Xiao, Nan, Yetim, Yavuz, Pan, Luoshang, Liu, Gaoxiang, Hu, Yuxi, Huang, Yuzhen, Xu, Jackie, Zhu, Rich, Zhang, Xin, Liu, Yiqun, Yin, Hang, Chen, Yuxin, Zhang, Buyun, Liu, Xiaoyi, Wang, Xingyuan, Mao, Wenguang, Li, Zhijing, Huang, Qin, Sun, Chonglin, Yu, Nancy, Gu, Shuo, Mao, Shupin, Au, Benjamin, Qin, Jingzheng, Yao, Peggy, Choi, Jae-Woo, Gao, Bin, Wang, Ernest, Zhang, Lei, Chen, Wen-Yen, Lee, Ted, Zha, Jay, Meng, Yi, Gong, Alex, Gao, Edison, Vahdatpour, Alireza, Han, Yiping, Yao, Yantao, Kureha, Toshinari, Chang, Shuo, Sultan, Musharaf, Bocharov, John, Chordia, Sagar, Gan, Xiaorui, Sun, Peng, Liu, Rocky, Long, Bo, Chen, Wenlin, Kolay, Santanu, Li, Huayu
Ads recommendation is a prominent service of online advertising systems and has been actively studied. Recent studies indicate that scaling-up and advanced design of the recommendation model can bring significant performance improvement. However, with a larger model scale, such prior studies have a significantly increasing gap from industry as they often neglect two fundamental challenges in industrial-scale applications. First, training and inference budgets are restricted for the model to be served, exceeding which may incur latency and impair user experience. Second, large-volume data arrive in a streaming mode with data distributions dynamically shifting, as new users/ads join and existing users/ads leave the system. We propose the External Large Foundation Model (ExFM) framework to address the overlooked challenges. Specifically, we develop external distillation and a data augmentation system (DAS) to control the computational cost of training/inference while maintaining high performance. We design the teacher in a way like a foundation model (FM) that can serve multiple students as vertical models (VMs) to amortize its building cost. We propose Auxiliary Head and Student Adapter to mitigate the data distribution gap between FM and VMs caused by the streaming data issue. Comprehensive experiments on internal industrial-scale applications and public datasets demonstrate significant performance gain by ExFM.
Joint Optimal Transport and Embedding for Network Alignment
Yu, Qi, Zeng, Zhichen, Yan, Yuchen, Ying, Lei, Srikant, R., Tong, Hanghang
Network alignment, which aims to find node correspondence across different networks, is the cornerstone of various downstream multi-network and Web mining tasks. Most of the embedding-based methods indirectly model cross-network node relationships by contrasting positive and negative node pairs sampled from hand-crafted strategies, which are vulnerable to graph noises and lead to potential misalignment of nodes. Another line of work based on the optimal transport (OT) theory directly models cross-network node relationships and generates noise-reduced alignments. However, OT methods heavily rely on fixed, pre-defined cost functions that prohibit end-to-end training and are hard to generalize. In this paper, we aim to unify the embedding and OT-based methods in a mutually beneficial manner and propose a joint optimal transport and embedding framework for network alignment named JOENA. For one thing (OT for embedding), through a simple yet effective transformation, the noise-reduced OT mapping serves as an adaptive sampling strategy directly modeling all cross-network node pairs for robust embedding learning.For another (embedding for OT), on top of the learned embeddings, the OT cost can be gradually trained in an end-to-end fashion, which further enhances the alignment quality. With a unified objective, the mutual benefits of both methods can be achieved by an alternating optimization schema with guaranteed convergence. Extensive experiments on real-world networks validate the effectiveness and scalability of JOENA, achieving up to 16% improvement in MRR and 20x speedup compared with the state-of-the-art alignment methods.
Improving LLM General Preference Alignment via Optimistic Online Mirror Descent
Zhang, Yuheng, Yu, Dian, Ge, Tao, Song, Linfeng, Zeng, Zhichen, Mi, Haitao, Jiang, Nan, Yu, Dong
Reinforcement learning from human feedback (RLHF) has demonstrated remarkable effectiveness in aligning large language models (LLMs) with human preferences. Many existing alignment approaches rely on the Bradley-Terry (BT) model assumption, which assumes the existence of a ground-truth reward for each prompt-response pair. However, this assumption can be overly restrictive when modeling complex human preferences. In this paper, we drop the BT model assumption and study LLM alignment under general preferences, formulated as a two-player game. Drawing on theoretical insights from learning in games, we integrate optimistic online mirror descent into our alignment framework to approximate the Nash policy. Theoretically, we demonstrate that our approach achieves an $O(T^{-1})$ bound on the duality gap, improving upon the previous $O(T^{-1/2})$ result. More importantly, we implement our method and show through experiments that it outperforms state-of-the-art RLHF algorithms across multiple representative benchmarks.
Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs
Zhu, Kan, Tang, Tian, Xu, Qinyu, Gu, Yile, Zeng, Zhichen, Kadekodi, Rohan, Zhao, Liangyu, Li, Ang, Krishnamurthy, Arvind, Kasikci, Baris
Long-context models are essential for many applications but face inefficiencies in loading large KV caches during decoding. Prior methods enforce fixed token budgets for sparse attention, assuming a set number of tokens can approximate full attention. However, these methods overlook variations in the importance of attention across heads, layers, and contexts. To address these limitations, we propose Tactic, a sparsity-adaptive and calibration-free sparse attention mechanism that dynamically selects tokens based on their cumulative attention scores rather than a fixed token budget. By setting a target fraction of total attention scores, Tactic ensures that token selection naturally adapts to variations in attention sparsity. To efficiently approximate this selection, Tactic leverages clustering-based sorting and distribution fitting, allowing it to accurately estimate token importance with minimal computational overhead. We show that Tactic outperforms existing sparse attention algorithms, achieving superior accuracy and up to 7.29x decode attention speedup. This improvement translates to an overall 1.58x end-to-end inference speedup, making Tactic a practical and effective solution for long-context LLM inference in accuracy-sensitive applications.
THeGCN: Temporal Heterophilic Graph Convolutional Network
Yan, Yuchen, Chen, Yuzhong, Chen, Huiyuan, Li, Xiaoting, Xu, Zhe, Zeng, Zhichen, Liu, Lihui, Liu, Zhining, Tong, Hanghang
Graph Neural Networks (GNNs) have exhibited remarkable efficacy in diverse graph learning tasks, particularly on static homophilic graphs. Recent attention has pivoted towards more intricate structures, encompassing (1) static heterophilic graphs encountering the edge heterophily issue in the spatial domain and (2) event-based continuous graphs in the temporal domain. State-of-the-art (SOTA) has been concurrently addressing these two lines of work but tends to overlook the presence of heterophily in the temporal domain, constituting the temporal heterophily issue. Furthermore, we highlight that the edge heterophily issue and the temporal heterophily issue often co-exist in event-based continuous graphs, giving rise to the temporal edge heterophily challenge. To tackle this challenge, this paper first introduces the temporal edge heterophily measurement. Subsequently, we propose the Temporal Heterophilic Graph Convolutional Network (THeGCN), an innovative model that incorporates the low/high-pass graph signal filtering technique to accurately capture both edge (spatial) heterophily and temporal heterophily. Specifically, the THeGCN model consists of two key components: a sampler and an aggregator. The sampler selects events relevant to a node at a given moment. Then, the aggregator executes message-passing, encoding temporal information, node attributes, and edge attributes into node embeddings. Extensive experiments conducted on 5 real-world datasets validate the efficacy of THeGCN.
InterFormer: Towards Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction
Zeng, Zhichen, Liu, Xiaolong, Hang, Mengyue, Liu, Xiaoyi, Zhou, Qinghai, Yang, Chaofei, Liu, Yiqun, Ruan, Yichen, Chen, Laming, Chen, Yuxin, Hao, Yujia, Xu, Jiaqi, Nie, Jade, Liu, Xi, Zhang, Buyun, Wen, Wei, Yuan, Siyang, Wang, Kai, Chen, Wen-Yen, Han, Yiping, Li, Huayu, Yang, Chunzhi, Long, Bo, Yu, Philip S., Tong, Hanghang, Yang, Jiyan
Click-through rate (CTR) prediction, which predicts the probability of a user clicking an ad, is a fundamental task in recommender systems. The emergence of heterogeneous information, such as user profile and behavior sequences, depicts user interests from different aspects. A mutually beneficial integration of heterogeneous information is the cornerstone towards the success of CTR prediction. However, most of the existing methods suffer from two fundamental limitations, including (1) insufficient inter-mode interaction due to the unidirectional information flow between modes, and (2) aggressive information aggregation caused by early summarization, resulting in excessive information loss. To address the above limitations, we propose a novel module named InterFormer to learn heterogeneous information interaction in an interleaving style. To achieve better interaction learning, InterFormer enables bidirectional information flow for mutually beneficial learning across different modes. To avoid aggressive information aggregation, we retain complete information in each data mode and use a separate bridging arch for effective information selection and summarization. Our proposed InterFormer achieves state-of-the-art performance on three public datasets and a large-scale industrial dataset.
PyG-SSL: A Graph Self-Supervised Learning Toolkit
Zheng, Lecheng, Jing, Baoyu, Li, Zihao, Zeng, Zhichen, Wei, Tianxin, Ai, Mengting, He, Xinrui, Liu, Lihui, Fu, Dongqi, You, Jiaxuan, Tong, Hanghang, He, Jingrui
Graph Self-Supervised Learning (SSL) has emerged as a pivotal area of research in recent years. By engaging in pretext tasks to learn the intricate topological structures and properties of graphs using unlabeled data, these graph SSL models achieve enhanced performance, improved generalization, and heightened robustness. Despite the remarkable achievements of these graph SSL methods, their current implementation poses significant challenges for beginners and practitioners due to the complex nature of graph structures, inconsistent evaluation metrics, and concerns regarding reproducibility hinder further progress in this field. Recognizing the growing interest within the research community, there is an urgent need for a comprehensive, beginner-friendly, and accessible toolkit consisting of the most representative graph SSL algorithms. To address these challenges, we present a Graph SSL toolkit named PyG-SSL, which is built upon PyTorch and is compatible with various deep learning and scientific computing backends. Within the toolkit, we offer a unified framework encompassing dataset loading, hyper-parameter configuration, model training, and comprehensive performance evaluation for diverse downstream tasks. Moreover, we provide beginner-friendly tutorials and the best hyper-parameters of each graph SSL algorithm on different graph datasets, facilitating the reproduction of results.