Lv, Kai
DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting
Lv, Kai, Guo, Honglin, Guo, Qipeng, Qiu, Xipeng
Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity. Nevertheless, the draft model introduces additional computational overhead, becoming a performance bottleneck and increasing the time to first token (TTFT). Previous approaches to mitigate draft model overhead have primarily relied on heuristics and generally failed to match the quality of the draft language models. To address these challenges, we propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively, enabling parallel decoding while preserving draft quality. Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality. Extensive experiments across seven tasks show that DuoDecoding achieves up to 2.61x speedup in generation latency, while reducing TTFT to 83% of that in conventional speculative decoding. The Code is available at https://github.com/KaiLv69/DuoDecoding.
CritiQ: Mining Data Quality Criteria from Human Preferences
Guo, Honglin, Lv, Kai, Guo, Qipeng, Liang, Tianyi, Xi, Zhiheng, Song, Demin, Zhang, Qiuyinzhe, Sun, Yu, Chen, Kai, Qiu, Xipeng, Gui, Tao
Language model heavily depends on high-quality data for optimal performance. Existing approaches rely on manually designed heuristics, the perplexity of existing models, training classifiers, or careful prompt engineering, which require significant expert experience and human annotation effort while introduce biases. We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality with only $\sim$30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We build a knowledge base that extracts quality criteria from previous work to boost CritiQ Flow. Compared to perplexity- and classifier- based methods, verbal criteria are more interpretable and possess reusable value. After deriving the criteria, we train the CritiQ Scorer to give quality scores and perform efficient data selection. We demonstrate the effectiveness of our method in the code, math, and logic domains, achieving high accuracy on human-annotated test sets. To validate the quality of the selected data, we continually train Llama 3.1 models and observe improved performance on downstream tasks compared to uniform sampling. Ablation studies validate the benefits of the knowledge base and the reflection process. We analyze how criteria evolve and the effectiveness of majority voting.
FastMCTS: A Simple Sampling Strategy for Data Synthesis
Li, Peiji, Lv, Kai, Shao, Yunfan, Ma, Yichuan, Li, Linyang, Zheng, Xiaoqing, Qiu, Xipeng, Guo, Qipeng
Synthetic high-quality multi-step reasoning data can significantly enhance the performance of large language models on various tasks. However, most existing methods rely on rejection sampling, which generates trajectories independently and suffers from inefficiency and imbalanced sampling across problems of varying difficulty. In this work, we introduce FastMCTS, an innovative data synthesis strategy inspired by Monte Carlo Tree Search. FastMCTS provides a more efficient sampling method for multi-step reasoning data, offering step-level evaluation signals and promoting balanced sampling across problems of different difficulty levels. Experiments on both English and Chinese reasoning datasets demonstrate that FastMCTS generates over 30\% more correct reasoning paths compared to rejection sampling as the number of generated tokens scales up. Furthermore, under comparable synthetic data budgets, models trained on FastMCTS-generated data outperform those trained on rejection sampling data by 3.9\% across multiple benchmarks. As a lightweight sampling strategy, FastMCTS offers a practical and efficient alternative for synthesizing high-quality reasoning data. Our code will be released soon.
CoDe: Communication Delay-Tolerant Multi-Agent Collaboration via Dual Alignment of Intent and Timeliness
Song, Shoucheng, Lin, Youfang, Han, Sheng, Yao, Chang, Wu, Hao, Wang, Shuo, Lv, Kai
Communication has been widely employed to enhance multi-agent collaboration. Previous research has typically assumed delay-free communication, a strong assumption that is challenging to meet in practice. However, real-world agents suffer from channel delays, receiving messages sent at different time points, termed {\it{Asynchronous Communication}}, leading to cognitive biases and breakdowns in collaboration. This paper first defines two communication delay settings in MARL and emphasizes their harm to collaboration. To handle the above delays, this paper proposes a novel framework, Communication Delay-tolerant Multi-Agent Collaboration (CoDe). At first, CoDe learns an intent representation as messages through future action inference, reflecting the stable future behavioral trends of the agents. Then, CoDe devises a dual alignment mechanism of intent and timeliness to strengthen the fusion process of asynchronous messages. In this way, agents can extract the long-term intent of others, even from delayed messages, and selectively utilize the most recent messages that are relevant to their intent. Experimental results demonstrate that CoDe outperforms baseline algorithms in three MARL benchmarks without delay and exhibits robustness under fixed and time-varying delays.
InternLM2 Technical Report
Cai, Zheng, Cao, Maosong, Chen, Haojiong, Chen, Kai, Chen, Keyu, Chen, Xin, Chen, Xun, Chen, Zehui, Chen, Zhi, Chu, Pei, Dong, Xiaoyi, Duan, Haodong, Fan, Qi, Fei, Zhaoye, Gao, Yang, Ge, Jiaye, Gu, Chenya, Gu, Yuzhe, Gui, Tao, Guo, Aijia, Guo, Qipeng, He, Conghui, Hu, Yingfan, Huang, Ting, Jiang, Tao, Jiao, Penglong, Jin, Zhenjiang, Lei, Zhikai, Li, Jiaxing, Li, Jingwen, Li, Linyang, Li, Shuaibin, Li, Wei, Li, Yining, Liu, Hongwei, Liu, Jiangning, Hong, Jiawei, Liu, Kaiwen, Liu, Kuikun, Liu, Xiaoran, Lv, Chengqi, Lv, Haijun, Lv, Kai, Ma, Li, Ma, Runyuan, Ma, Zerun, Ning, Wenchang, Ouyang, Linke, Qiu, Jiantao, Qu, Yuan, Shang, Fukai, Shao, Yunfan, Song, Demin, Song, Zifan, Sui, Zhihao, Sun, Peng, Sun, Yu, Tang, Huanze, Wang, Bin, Wang, Guoteng, Wang, Jiaqi, Wang, Jiayu, Wang, Rui, Wang, Yudong, Wang, Ziyi, Wei, Xingjian, Weng, Qizhen, Wu, Fan, Xiong, Yingtong, Xu, Chao, Xu, Ruiliang, Yan, Hang, Yan, Yirong, Yang, Xiaogui, Ye, Haochen, Ying, Huaiyuan, Yu, Jia, Yu, Jing, Zang, Yuhang, Zhang, Chuyu, Zhang, Li, Zhang, Pan, Zhang, Peng, Zhang, Ruijie, Zhang, Shuo, Zhang, Songyang, Zhang, Wenjian, Zhang, Wenwei, Zhang, Xingcheng, Zhang, Xinyue, Zhao, Hui, Zhao, Qian, Zhao, Xiaomeng, Zhou, Fengzhe, Zhou, Zaida, Zhuo, Jingming, Zou, Yicheng, Qiu, Xipeng, Qiao, Yu, Lin, Dahua
The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI). However, replicating such advancements in open-source models has been challenging. This paper introduces InternLM2, an open-source LLM that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks, long-context modeling, and open-ended subjective evaluations through innovative pre-training and optimization techniques. The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types including text, code, and long-context data. InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages, exhibiting remarkable performance on the 200k ``Needle-in-a-Haystack" test. InternLM2 is further aligned using Supervised Fine-Tuning (SFT) and a novel Conditional Online Reinforcement Learning from Human Feedback (COOL RLHF) strategy that addresses conflicting human preferences and reward hacking. By releasing InternLM2 models in different training stages and model sizes, we provide the community with insights into the model's evolution.
LongWanjuan: Towards Systematic Measurement for Long Text Quality
Lv, Kai, Liu, Xiaoran, Guo, Qipeng, Yan, Hang, He, Conghui, Qiu, Xipeng, Lin, Dahua
The quality of training data are crucial for enhancing the long-text capabilities of foundation models. Despite existing efforts to refine data quality through heuristic rules and evaluations based on data diversity and difficulty, there's a lack of systematic approaches specifically tailored for assessing long texts. Addressing this gap, our work systematically measures the quality of long texts by evaluating three fundamental linguistic dimensions: coherence, cohesion, and complexity. Drawing inspiration from the aforementioned three dimensions, we introduce a suite of metrics designed to evaluate the quality of long texts, encompassing both statistical and pre-trained language model-based ones. Leveraging these metrics, we present LongWanjuan, a bilingual dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens. In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality. Furthermore, we devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks. The code and dataset are available at https://github.com/OpenLMLab/LongWanjuan.
CoLLiE: Collaborative Training of Large Language Models in an Efficient Way
Lv, Kai, Zhang, Shuo, Gu, Tianle, Xing, Shuhao, Hong, Jiawei, Chen, Keyu, Liu, Xiaoran, Yang, Yuqing, Guo, Honglin, Liu, Tengxiao, Sun, Yu, Guo, Qipeng, Yan, Hang, Qiu, Xipeng
Large language models (LLMs) are increasingly pivotal in a wide range of natural language processing tasks. Access to pre-trained models, courtesy of the open-source community, has made it possible to adapt these models to specific applications for enhanced performance. However, the substantial resources required for training these models necessitate efficient solutions. This paper introduces CoLLiE, an efficient library that facilitates collaborative training of large language models using 3D parallelism, parameter-efficient fine-tuning (PEFT) methods, and optimizers such as Lion, Adan, Sophia, LOMO and AdaLomo. With its modular design and comprehensive functionality, CoLLiE offers a balanced blend of efficiency, ease of use, and customization. CoLLiE has proven superior training efficiency in comparison with prevalent solutions in pre-training and fine-tuning scenarios. Furthermore, we provide an empirical evaluation of the correlation between model size and GPU memory consumption under different optimization methods, as well as an analysis of the throughput. Lastly, we carry out a comprehensive comparison of various optimizers and PEFT methods within the instruction-tuning context. CoLLiE is available at https://github.com/OpenLMLab/collie.
AdaLomo: Low-memory Optimization with Adaptive Learning Rate
Lv, Kai, Yan, Hang, Guo, Qipeng, Lv, Haijun, Qiu, Xipeng
Large language models have achieved remarkable success, but their extensive parameter size necessitates substantial memory for training, thereby setting a high threshold. While the recently proposed low-memory optimization (LOMO) reduces memory footprint, its optimization technique, akin to stochastic gradient descent, is sensitive to hyper-parameters and exhibits suboptimal convergence, failing to match the performance of the prevailing optimizer for large language models, AdamW. Through empirical analysis of the Adam optimizer, we found that, compared to momentum, the adaptive learning rate is more critical for bridging the gap. Building on this insight, we introduce the low-memory optimization with adaptive learning rate (AdaLomo), which offers an adaptive learning rate for each parameter. To maintain memory efficiency, we employ non-negative matrix factorization for the second-order moment estimation in the optimizer state. Additionally, we suggest the use of a grouped update normalization to stabilize convergence. Our experiments with instruction-tuning and further pre-training demonstrate that AdaLomo achieves results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
Full Parameter Fine-tuning for Large Language Models with Limited Resources
Lv, Kai, Yang, Yuqing, Liu, Tengxiao, Gao, Qinghui, Guo, Qipeng, Qiu, Xipeng
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training. Lowering the threshold for LLMs training would encourage greater participation from researchers, benefiting both academia and society. While existing approaches have focused on parameter-efficient fine-tuning, which tunes or adds a small number of parameters, few have addressed the challenge of tuning the full parameters of LLMs with limited resources. In this work, we propose a new optimizer, LOw-Memory Optimization (LOMO), which fuses the gradient computation and the parameter update in one step to reduce memory usage. By integrating LOMO with existing memory saving techniques, we reduce memory usage to 10.8% compared to the standard approach (DeepSpeed solution). Consequently, our approach enables the full parameter fine-tuning of a 65B model on a single machine with 8 RTX 3090, each with 24GB memory.
Unified Demonstration Retriever for In-Context Learning
Li, Xiaonan, Lv, Kai, Yan, Hang, Lin, Tianyang, Zhu, Wei, Ni, Yuan, Xie, Guotong, Wang, Xiaoling, Qiu, Xipeng
In-context learning is a new learning paradigm where a language model conditions on a few input-output pairs (demonstrations) and a test input, and directly outputs the prediction. It has been shown highly dependent on the provided demonstrations and thus promotes the research of demonstration retrieval: given a test input, relevant examples are retrieved from the training set to serve as informative demonstrations for in-context learning. While previous works focus on training task-specific retrievers for several tasks separately, these methods are often hard to transfer and scale on various tasks, and separately trained retrievers incur a lot of parameter storage and deployment cost. In this paper, we propose Unified Demonstration Retriever (\textbf{UDR}), a single model to retrieve demonstrations for a wide range of tasks. To train UDR, we cast various tasks' training signals into a unified list-wise ranking formulation by language model's feedback. Then we propose a multi-task list-wise ranking training framework, with an iterative mining strategy to find high-quality candidates, which can help UDR fully incorporate various tasks' signals. Experiments on 30+ tasks across 13 task families and multiple data domains show that UDR significantly outperforms baselines. Further analyses show the effectiveness of each proposed component and UDR's strong ability in various scenarios including different LMs (1.3B - 175B), unseen datasets, varying demonstration quantities, etc.