Jiang, Zihan
TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation
Sun, Lin, Zhao, Guangxiang, Jian, Xiaoqi, Wu, Yuhan, Lin, Weihong, Zhu, Yongfu, Jia, Change, Zhang, Linglin, Wu, Jinzhu, Ran, Junfeng, Hu, Sai-er, Jiang, Zihan, Zhou, Junting, Liu, Wenrui, Cui, Bin, Yang, Tong, Zhang, Xiangzheng
The challenge of reducing the size of Large Language Models (LLMs) while maintaining their performance has gained significant attention. However, existing methods, such as model distillation and transfer learning, often fail to achieve high accuracy. To address this limitation, we introduce the Branch-Merge distillation approach, which enhances model compression through two phases: (1) the Branch Phase, where knowledge from a large teacher model is \textit{selectively distilled} into specialized student models via domain-specific supervised fine-tuning (SFT); And (2) the Merge Phase, where these student models are merged to enable cross-domain knowledge transfer and improve generalization. We validate our distillation approach using DeepSeek-R1 as the teacher and DeepSeek-R1-Distill-Qwen-32B as the student. The resulting merged model, TinyR1-32B-Preview, outperforms its counterpart DeepSeek-R1-Distill-Qwen-32B across multiple benchmarks, including Mathematics (+5.5 points), Coding (+4.4 points) and Science (+2.9 points), while achieving near-equal performance to DeepSeek-R1 on AIME 2024. The Branch-Merge distillation approach provides a scalable solution for creating smaller, high-performing LLMs with reduced computational cost and time.
INT-FlashAttention: Enabling Flash Attention for INT8 Quantization
Chen, Shimao, Liu, Zirui, Wu, Zhiying, Zheng, Ce, Cong, Peizhuang, Jiang, Zihan, Wu, Yuhan, Su, Lei, Yang, Tong
As the foundation of large language models (LLMs), self-attention module faces the challenge of quadratic time and memory complexity with respect to sequence length. FlashAttention accelerates attention computation and reduces its memory usage by leveraging the GPU memory hierarchy. A promising research direction is to integrate FlashAttention with quantization methods. This paper introduces INT-FlashAttention, the first INT8 quantization architecture compatible with the forward workflow of FlashAttention, which significantly improves the inference speed of FlashAttention on Ampere GPUs. We implement our INT-FlashAttention prototype with fully INT8 activations and general matrix-multiplication (GEMM) kernels, making it the first attention operator with fully INT8 input. As a general token-level post-training quantization framework, INT-FlashAttention is also compatible with other data formats like INT4, etc. Experimental results show INT-FlashAttention achieves 72% faster inference speed and 82% smaller quantization error compared to standard FlashAttention with FP16 and FP8 data format.
MCNS: Mining Causal Natural Structures Inside Time Series via A Novel Internal Causality Scheme
Liu, Yuanhao, Du, Dehui, Jiang, Zihan, Huang, Anyan, Li, Yiyang
Causal inference permits us to discover covert relationships of various variables in time series. However, in most existing works, the variables mentioned above are the dimensions. The causality between dimensions could be cursory, which hinders the comprehension of the internal relationship and the benefit of the causal graph to the neural networks (NNs). In this paper, we find that causality exists not only outside but also inside the time series because it reflects a succession of events in the real world. It inspires us to seek the relationship between internal subsequences. However, the challenges are the hardship of discovering causality from subsequences and utilizing the causal natural structures to improve NNs. To address these challenges, we propose a novel framework called Mining Causal Natural Structure (MCNS), which is automatic and domain-agnostic and helps to find the causal natural structures inside time series via the internal causality scheme. We evaluate the MCNS framework and impregnation NN with MCNS on time series classification tasks. Experimental results illustrate that our impregnation, by refining attention, shape selection classification, and pruning datasets, drives NN, even the data itself preferable accuracy and interpretability. Besides, MCNS provides an in-depth, solid summary of the time series and datasets.
CMLCompiler: A Unified Compiler for Classical Machine Learning
Wen, Xu, Gao, Wanling, Li, Anzheng, Wang, Lei, Jiang, Zihan, Zhan, Jianfeng
Classical machine learning (CML) occupies nearly half of machine learning pipelines in production applications. Unfortunately, it fails to utilize the state-of-the-practice devices fully and performs poorly. Without a unified framework, the hybrid deployments of deep learning (DL) and CML also suffer from severe performance and portability issues. This paper presents the design of a unified compiler, called CMLCompiler, for CML inference. We propose two unified abstractions: operator representations and extended computational graphs. The CMLCompiler framework performs the conversion and graph optimization based on two unified abstractions, then outputs an optimized computational graph to DL compilers or frameworks. We implement CMLCompiler on TVM. The evaluation shows CMLCompiler's portability and superior performance. It achieves up to 4.38$\times$ speedup on CPU, 3.31$\times$ speedup on GPU, and 5.09$\times$ speedup on IoT devices, compared to the state-of-the-art solutions -- scikit-learn, intel sklearn, and hummingbird. Our performance of CML and DL mixed pipelines achieves up to 3.04x speedup compared with cross-framework implementations. The project documents and source code are available at https://www.computercouncil.org/cmlcompiler.
OpenClinicalAI: enabling AI to diagnose diseases in real-world clinical settings
Huang, Yunyou, Wang, Nana, Tang, Suqin, Ma, Li, Hao, Tianshu, Jiang, Zihan, Zhang, Fan, Kang, Guoxin, Miao, Xiuxia, Guan, Xianglong, Zhang, Ruchang, Zhang, Zhifei, Zhan, Jianfeng
This paper quantitatively reveals the state-of-the-art and state-of-the-practice AI systems only achieve acceptable performance on the stringent conditions that all categories of subjects are known, which we call closed clinical settings, but fail to work in real-world clinical settings. Compared to the diagnosis task in the closed setting, real-world clinical settings pose severe challenges, and we must treat them differently. We build a clinical AI benchmark named Clinical AIBench to set up real-world clinical settings to facilitate researches. We propose an open, dynamic machine learning framework and develop an AI system named OpenClinicalAI to diagnose diseases in real-world clinical settings. The first versions of Clinical AIBench and OpenClinicalAI target Alzheimer's disease. In the real-world clinical setting, OpenClinicalAI significantly outperforms the state-of-the-art AI system. In addition, OpenClinicalAI develops personalized diagnosis strategies to avoid unnecessary testing and seamlessly collaborates with clinicians. It is promising to be embedded in the current medical systems to improve medical services.
AIBench Training: Balanced Industry-Standard AI Training Benchmarking
Tang, Fei, Gao, Wanling, Zhan, Jianfeng, Lan, Chuanxin, Wen, Xu, Wang, Lei, Luo, Chunjie, Dai, Jiahui, Cao, Zheng, Xiong, Xingwang, Jiang, Zihan, Hao, Tianshu, Fan, Fanda, Zhang, Fan, Huang, Yunyou, Chen, Jianan, Du, Mengjia, Ren, Rui, Zheng, Chen, Zheng, Daoyi, Tang, Haoning, Zhan, Kunlin, Wang, Biao, Kong, Defei, Yu, Minghe, Tan, Chongkang, Li, Huan, Tian, Xinhui, Li, Yatao, Lu, Gang, Shao, Junchao, Wang, Zhenyu, Wang, Xiaoyu, Ye, Hainan
Earlier-stage evaluations of a new AI architecture/system need affordable AI benchmarks, while using a few AI component benchmarks alone in the other stages may lead to misleading conclusions. This paper proposes a balanced benchmarking methodology. Performing an exhaustive survey on Internet service AI domains, we identify and implement seventeen representative AI tasks with the state-of-the-art models to guarantee the diversity and representativeness of the benchmarks. Meanwhile, we keep a benchmark subset to a minimum for affordability. We contribute by far the most comprehensive AI training benchmark suite with seventeen industry partners. The evaluations show: (1) AIBench Training outperforms MLPerf Training in terms of the diversity and representativeness of model complexity, computational cost, convergent rate, computation and memory access patterns, and hotspot functions; (2) With respect to the AIBench full benchmarks, its subset shortens the benchmarking cost by 54%, while maintaining the primary workload characteristics; (3) The performance ranking shows the single-purpose AI accelerator like TPU with the optimized TensorFlow framework performs better than that of GPUs while losing the latters' general support for a variety of AI models. The AIBench Training specifications, source code, testbed, and performance numbers are publicly available from the web site http://www.benchcouncil.org/AIBench/index.html.
HPC AI500: A Benchmark Suite for HPC AI Systems
Jiang, Zihan, Gao, Wanling, Wang, Lei, Xiong, Xingwang, Zhang, Yuchen, Wen, Xu, Luo, Chunjie, Ye, Hainan, Zhang, Yunquan, Feng, Shengzhong, Li, Kenli, Xu, Weijia, Zhan, Jianfeng
In recent years, with the trend of applying deep learning (DL) in high performance scientific computing, the unique characteristics of emerging DL workloads in HPC raise great challenges in designing, implementing HPC AI systems. The community needs a new yard stick for evaluating the future HPC systems. In this paper, we propose HPC AI500 --- a benchmark suite for evaluating HPC systems that running scientific DL workloads. Covering the most representative scientific fields, each workload from HPC AI500 is based on real-world scientific DL applications. Currently, we choose 14 scientific DL benchmarks from perspectives of application scenarios, data sets, and software stack. We propose a set of metrics for comprehensively evaluating the HPC AI systems, considering both accuracy, performance as well as power and cost. We provide a scalable reference implementation of HPC AI500. HPC AI500 is a part of the open-source AIBench project, the specification and source code are publicly available from \url{http://www.benchcouncil.org/AIBench/index.html}.