Zhu, Xuekai
Technologies on Effectiveness and Efficiency: A Survey of State Spaces Models
Lv, Xingtai, Sun, Youbang, Zhang, Kaiyan, Qu, Shang, Zhu, Xuekai, Fan, Yuchen, Wu, Yi, Hua, Ermo, Long, Xinwei, Ding, Ning, Zhou, Bowen
State Space Models (SSMs) have emerged as a promising alternative to the popular transformer-based models and have been increasingly gaining attention. Compared to transformers, SSMs excel at tasks with sequential data or longer contexts, demonstrating comparable performances with significant efficiency gains. In this survey, we provide a coherent and systematic overview for SSMs, including their theoretical motivations, mathematical formulations, comparison with existing model classes, and various applications. We divide the SSM series into three main sections, providing a detailed introduction to the original SSM, the structured SSM represented by S4, and the selective SSM typified by Mamba. We put an emphasis on technicality, and highlight the various key techniques introduced to address the effectiveness and efficiency of SSMs. We hope this manuscript serves as an introduction for researchers to explore the theoretical foundations of SSMs.
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
Zuo, Yuxin, Qu, Shang, Li, Yifei, Chen, Zhangren, Zhu, Xuekai, Hua, Ermo, Zhang, Kaiyan, Ding, Ning, Zhou, Bowen
We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 16 leading models on MedXpertQA. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.
Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization
Hua, Ermo, Jiang, Che, Lv, Xingtai, Zhang, Kaiyan, Ding, Ning, Sun, Youbang, Qi, Biqing, Fan, Yuchen, Zhu, Xuekai, Zhou, Bowen
Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While existing works mainly address RoPE's limitations within attention mechanism, this paper provides an analysis across nearly all parts of LMs, uncovering their adverse effects on length generalization for RoPE-based attention. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs Fourier Series and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales show that, within varying context windows, FoPE can maintain a more stable perplexity and a more consistent accuracy in a needle-in-haystack task compared to RoPE and ALiBi. Several analyses and ablations bring further support to our method and theoretical modeling.
How to Synthesize Text Data without Model Collapse?
Zhu, Xuekai, Cheng, Daixuan, Li, Hengli, Zhang, Kaiyan, Hua, Ermo, Lv, Xingtai, Ding, Ning, Lin, Zhouhan, Zheng, Zilong, Zhou, Bowen
Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-{n} models will inevitably be trained on a blend of synthetic and humanproduced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised finetuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance. As generative artificial intelligence (AI) (Rombach et al., 2021; Achiam et al., 2023) becomes increasingly prevalent in research and industry, synthetic data will proliferate throughout the web data ecosystem. Consequently, future training of GPT-{n} on a mixture of synthetic and humanproduced data will be inevitable. Thus, model collapse is a critical concern that must be considered when training models on synthetic data. Model collapse refers to a degenerative process in which the output data of learned generative models contaminates the training sets of subsequent generations. As shown in Figure 1, iterative training coupled with data synthesis induces a progressive accumulation of test errors (Shumailov et al., 2024; Dohmatob et al., 2024a). Consequently, generative models increasingly overfit to synthetic data distributions, failing to capture the complexity in human-produced data.
UltraMedical: Building Specialized Generalists in Biomedicine
Zhang, Kaiyan, Zeng, Sihang, Hua, Ermo, Ding, Ning, Chen, Zhang-Ren, Ma, Zhiyuan, Li, Haoxin, Cui, Ganqu, Qi, Biqing, Zhu, Xuekai, Lv, Xingtai, Jinfang, Hu, Liu, Zhiyuan, Zhou, Bowen
Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains and are moving towards more specialized areas. Recent advanced proprietary models such as GPT-4 and Gemini have achieved significant advancements in biomedicine, which have also raised privacy and security challenges. The construction of specialized generalists hinges largely on high-quality datasets, enhanced by techniques like supervised fine-tuning and reinforcement learning from human or AI feedback, and direct preference optimization. However, these leading technologies (e.g., preference learning) are still significantly limited in the open source community due to the scarcity of specialized data. In this paper, we present the UltraMedical collections, which consist of high-quality manual and synthetic datasets in the biomedicine domain, featuring preference annotations across multiple advanced LLMs. By utilizing these datasets, we fine-tune a suite of specialized medical models based on Llama-3 series, demonstrating breathtaking capabilities across various medical benchmarks. Moreover, we develop powerful reward models skilled in biomedical and general reward benchmark, enhancing further online preference learning within the biomedical LLM community.
Critical Data Size of Language Models from a Grokking Perspective
Zhu, Xuekai, Fu, Yao, Zhou, Bowen, Lin, Zhouhan
We explore the critical data size in language models, a threshold that marks a fundamental shift from quick memorization to slow generalization. We formalize the phase transition under the grokking configuration into the Data Efficiency Hypothesis and identify data insufficiency, sufficiency, and surplus regimes in language models training dynamics. We develop a grokking configuration to reproduce grokking on simplistic language models stably by rescaling initialization and weight decay. We show that generalization occurs only when language models reach a critical size. We analyze grokking across sample-wise and model-wise, verifying the proposed data efficiency hypothesis. Our experiments reveal smoother phase transitions occurring at the critical dataset size for language datasets. As the model size increases, this critical point also becomes larger, indicating that larger models require more data. Our results deepen the understanding of language model training, offering a novel perspective on the role of data in the learning mechanism of language models.
CRaSh: Clustering, Removing, and Sharing Enhance Fine-tuning without Full Large Language Model
Zhang, Kaiyan, Ding, Ning, Qi, Biqing, Zhu, Xuekai, Long, Xinwei, Zhou, Bowen
Instruction tuning has recently been recognized as an effective way of aligning Large Language Models (LLMs) to enhance their generalization ability across various tasks. However, when tuning publicly accessible, centralized LLMs with private instruction data, privacy concerns are inevitable. While direct transfer of parameterized modules between models is a plausible approach to address this, its implications and effectiveness need further exploration. This paper focuses on Offsite-Tuning (OFT), a representative technique that transfers transformer blocks between centralized LLMs and downstream emulators. Given the limited understanding of the underlying mechanism of OFT, we perform an empirical analysis on LLMs from the perspectives of representation and functional similarity. Interestingly, our findings reveal a unique modular structure within the layers of LLMs that appears to emerge as the model size expands. Simultaneously, we note subtle but potentially significant changes in representation and intermediate predictions across the layers. Inspired by these observations, we propose CRaSh, involving Clustering, Removing, and Sharing, a training-free strategy to derive improved emulators from LLMs. CRaSh significantly boosts performance of OFT with billions of parameters. Furthermore, we investigate the optimal solutions yielded by fine-tuning with and without full model through the lens of loss landscape. Our findings demonstrate a linear connectivity among these optima falling over the same basin, thereby highlighting the effectiveness of CRaSh and OFT. The source code is publicly available at https://github.com/TsinghuaC3I/CRaSh.
PaD: Program-aided Distillation Specializes Large Models in Reasoning
Zhu, Xuekai, Qi, Biqing, Zhang, Kaiyan, Long, Xingwei, Zhou, Bowen
While Large Language Models (LLMs) excel in several natural language processing tasks, their size and inaccessibility present challenges for extensive practical application. Previous studies acquire specialized skills through distillation on LLMs, which result in trading generic abilities, called model specialization. As for reasoning ability, chain-of-thought was synthesized to subsequent distillation. However, due to hallucination, synthetic chain-of-thought from LLMs contains faulty reasoning. These incorrect reasoning steps damage the reasoning capability. To tackle above issues, we propose Program-aided Distillation (PaD), which distills LLMs to obtain specialized small models in reasoning tasks. In PaD, we strengthen specialized models with program-aided reasoning, and help them overcome faulty reasoning steps with automated error checking. Experimental results demonstrate that, on the GSM8K benchmark, a 0.06B model using PaD can not only outperform certain LLMs (e.g., LLaMA), but also achieves a 10% improvement over baselines with a significantly smaller scale of parameters and data. Data pruning analysis reveals that PaD possesses higher training efficiency.
StoryTrans: Non-Parallel Story Author-Style Transfer with Discourse Representations and Content Enhancing
Zhu, Xuekai, Guan, Jian, Huang, Minlie, Liu, Juan
Non-parallel text style transfer is an important task in natural language generation. However, previous studies concentrate on the token or sentence level, such as sentence sentiment and formality transfer, but neglect long style transfer at the discourse level. Long texts usually involve more complicated author linguistic preferences such as discourse structures than sentences. In this paper, we formulate the task of non-parallel story author-style transfer, which requires transferring an input story into a specified author style while maintaining source semantics. To tackle this problem, we propose a generation model, named StoryTrans, which leverages discourse representations to capture source content information and transfer them to target styles with learnable style embeddings. We use an additional training objective to disentangle stylistic features from the learned discourse representation to prevent the model from degenerating to an auto-encoder. Moreover, to enhance content preservation, we design a mask-and-fill framework to explicitly fuse style-specific keywords of source texts into generation. Furthermore, we constructed new datasets for this task in Chinese and English, respectively. Extensive experiments show that our model outperforms strong baselines in overall performance of style transfer and content preservation.