Ren, Siyu
Boosting LLM via Learning from Data Iteratively and Selectively
Jia, Qi, Ren, Siyu, Qin, Ziheng, Xue, Fuzhao, Ni, Jinjie, You, Yang
Datasets nowadays are generally constructed from multiple sources and using different synthetic techniques, making data de-noising and de-duplication crucial before being used for post-training. In this work, we propose to perform instruction tuning by iterative data selection (\ApproachName{}). We measure the quality of a sample from complexity and diversity simultaneously. Instead of calculating the complexity score once for all before fine-tuning, we highlight the importance of updating this model-specific score during fine-tuning to accurately accommodate the dynamic changes of the model. On the other hand, the diversity score is defined on top of the samples' responses under the consideration of their informativeness. IterIT integrates the strengths of both worlds by iteratively updating the complexity score for the top-ranked samples and greedily selecting the ones with the highest complexity-diversity score. Experiments on multiple instruction-tuning data demonstrate consistent improvements of IterIT over strong baselines. Moreover, our approach also generalizes well to domain-specific scenarios and different backbone models. All resources will be available at https://github.com/JiaQiSJTU/IterIT.
On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference
Ren, Siyu, Zhu, Kenny Q.
Despite the recent success associated with Large Language Models~(LLMs), they are notably cost-prohibitive to deploy in resource-constrained environments due to their excessive memory and computational demands. In addition to model parameters, the key-value cache is also stored in GPU memory, growing linearly with batch size and sequence length. As a remedy, recent works have proposed various eviction policies for maintaining the overhead of key-value cache under a given budget. This paper embarks on the efficacy of existing eviction policies in terms of \textit{importance score calculation} and \textit{eviction scope construction}. We identify the deficiency of prior policies in these two aspects and introduce RoCo, a \underline{r}\underline{o}bust \underline{c}ache \underline{o}mission policy based on temporal attention scores and robustness measures. Extensive experimentation spanning prefilling and auto-regressive decoding stages validates the superiority of RoCo. Finally, we release EasyKV, a versatile software package dedicated to user-friendly key-value constrained generative inference. Code available at \url{https://github.com/DRSY/EasyKV}.
EMO: Earth Mover Distance Optimization for Auto-Regressive Language Modeling
Ren, Siyu, Wu, Zhiyong, Zhu, Kenny Q.
Neural language models are probabilistic models of human text. They are predominantly trained using maximum likelihood estimation (MLE), which is equivalent to minimizing the forward cross-entropy between the empirical data distribution and the model distribution. However, various degeneration phenomena are still widely observed when decoding from the distributions learned by such models. We establish that the forward cross-entropy is suboptimal as a distance metric for aligning human and model distribution due to its (1) recall-prioritization (2) negative diversity ignorance and (3) train-test mismatch. In this paper, we propose Earth Mover Distance Optimization (EMO) for auto-regressive language modeling. EMO capitalizes on the inherent properties of earth mover distance to address the aforementioned challenges. Due to the high complexity of direct computation, we further introduce a feasible upper bound for EMO to ease end-to-end training. Upon extensive evaluation of language models trained using EMO and MLE. We find that EMO demonstrates a consistently better language modeling performance than MLE across domains. Moreover, EMO demonstrates noteworthy enhancements in downstream performance with minimal fine-tuning on merely 25,000 sentences. This highlights the tremendous potential of EMO as a lightweight calibration method for enhancing large-scale pre-trained language models.
Zero-shot Faithfulness Evaluation for Text Summarization with Foundation Language Model
Jia, Qi, Ren, Siyu, Liu, Yizhu, Zhu, Kenny Q.
Despite tremendous improvements in natural language generation, summarization models still suffer from the unfaithfulness issue. Previous work evaluates faithfulness either using models trained on the other tasks or in-domain synthetic data, or prompting a large model such as ChatGPT. This paper proposes to do zero-shot faithfulness evaluation simply with a moderately-sized foundation language model. We introduce a new metric FFLM, which is a combination of probability changes based on the intuition that prefixing a piece of text that is consistent with the output will increase the probability of predicting the output. Experiments show that FFLM performs competitively with or even outperforms ChatGPT on both inconsistency detection and faithfulness rating with 24x fewer parameters. FFLM also achieves improvements over other strong baselines.
Multi-turn Response Selection using Dialogue Dependency Relations
Jia, Qi, Liu, Yizhu, Ren, Siyu, Zhu, Kenny Q., Tang, Haifeng
Multi-turn response selection is a task designed for developing dialogue agents. The performance on this task has a remarkable improvement with pre-trained language models. However, these models simply concatenate the turns in dialogue history as the input and largely ignore the dependencies between the turns. In this paper, we propose a dialogue extraction algorithm to transform a dialogue history into threads based on their dependency relations. Each thread can be regarded as a self-contained sub-dialogue. We also propose Thread-Encoder model to encode threads and candidates into compact representations by pre-trained Transformers and finally get the matching score through an attention layer. The experiments show that dependency relations are helpful for dialogue context understanding, and our model outperforms the state-of-the-art baselines on both DSTC7 and DSTC8*, with competitive results on UbuntuV2.
Symbol-LLM: Towards Foundational Symbol-centric Interface For Large Language Models
Xu, Fangzhi, Wu, Zhiyong, Sun, Qiushi, Ren, Siyu, Yuan, Fei, Yuan, Shuai, Lin, Qika, Qiao, Yu, Liu, Jun
Large Language Models (LLMs) have greatly propelled the progress in natural language(NL)-centric tasks based on NL interface. However, the NL form is not enough for world knowledge. Current works focus on this question by injecting specific symbolic knowledge into LLM, which ignore two critical challenges: the interrelations between various symbols and the balance between symbolic-centric and NL-centric capabilities. In this work, we tackle these challenges from both a data and framework perspective and introduce Symbol-LLM series models. First, we collect 34 symbolic tasks, covering ~20 different forms, which are unified to capture symbol interrelations. Then, a two-stage tuning framework succeeds in injecting symbolic knowledge without loss of the generality ability. Extensive experiments on both symbol- and NL-centric tasks demonstrate the balanced and superior performances of Symbol-LLM series models.
Context Compression for Auto-regressive Transformers with Sentinel Tokens
Ren, Siyu, Jia, Qi, Zhu, Kenny Q.
The quadratic complexity of the attention module makes it gradually become the bulk of compute in Transformer-based LLMs during generation. Moreover, the excessive key-value cache that arises when dealing with long inputs also brings severe issues on memory footprint and inference latency. In this work, we propose a plug-and-play approach that is able to incrementally compress the intermediate activation of a specified span of tokens into compact ones, thereby reducing both memory and computational cost when processing subsequent context. Experiments on both in-domain language modeling and zero-shot open-ended document generation demonstrate the advantage of our approach over sparse attention baselines in terms of fluency, n-gram matching, and semantic similarity. At last, we comprehensively profile the benefit of context compression on improving the system throughout. Code is available at https://github.com/DRSY/KV_Compression.
Taxonomy of Abstractive Dialogue Summarization: Scenarios, Approaches and Future Directions
Jia, Qi, Liu, Yizhu, Ren, Siyu, Zhu, Kenny Q.
Abstractive dialogue summarization generates a concise and fluent summary covering the salient information in a dialogue among two or more interlocutors. It has attracted significant attention in recent years based on the massive emergence of social communication platforms and an urgent requirement for efficient dialogue information understanding and digestion. Different from news or articles in traditional document summarization, dialogues bring unique characteristics and additional challenges, including different language styles and formats, scattered information, flexible discourse structures, and unclear topic boundaries. This survey provides a comprehensive investigation of existing work for abstractive dialogue summarization from scenarios, approaches to evaluations. It categorizes the task into two broad categories according to the type of input dialogues, i.e., open-domain and task-oriented, and presents a taxonomy of existing techniques in three directions, namely, injecting dialogue features, designing auxiliary training tasks and using additional data. A list of datasets under different scenarios and widely-accepted evaluation metrics are summarized for completeness. After that, the trends of scenarios and techniques are summarized, together with deep insights into correlations between extensively exploited features and different scenarios. Based on these analyses, we recommend future directions, including more controlled and complicated scenarios, technical innovations and comparisons, publicly available datasets in special domains, etc. CCS Concepts: Computing methodologies Natural language generation; Discourse, dialogue and pragmatics; General and reference Surveys and overviews.
Low-Rank Prune-And-Factorize for Language Model Compression
Ren, Siyu, Zhu, Kenny Q.
The components underpinning PLMs -- large weight matrices -- were shown to bear considerable redundancy. Matrix factorization, a well-established technique from matrix theory, has been utilized to reduce the number of parameters in PLM. However, it fails to retain satisfactory performance under moderate to high compression rate. In this paper, we identify the \textit{full-rankness} of fine-tuned PLM as the fundamental bottleneck for the failure of matrix factorization and explore the use of network pruning to extract low-rank sparsity pattern desirable to matrix factorization. We find such low-rank sparsity pattern exclusively exists in models generated by first-order pruning, which motivates us to unite the two approaches and achieve more effective model compression. We further propose two techniques: sparsity-aware SVD and mixed-rank fine-tuning, which improve the initialization and training of the compression procedure, respectively. Experiments on GLUE and question-answering tasks show that the proposed method has superior compression-performance trade-off compared to existing approaches.
Pruning Pre-trained Language Models with Principled Importance and Self-regularization
Ren, Siyu, Zhu, Kenny Q.
Iterative pruning is one of the most effective compression methods for pre-trained language models. We discovered that finding the optimal pruning decision is an equality-constrained 0-1 Integer Linear Programming problem. The solution to this optimization problem leads to a principled importance criterion which we use to rank parameters during iterative model pruning. To mitigate the poor generalization at high sparsity levels, we propose a self-regularization scheme where model prediction is regularized by the latest checkpoint with increasing sparsity throughout pruning. Our experiments on natural language understanding, question-answering, named entity recognition, and data-to-text generation with various Transformer-based PLMs show the effectiveness of the approach at various sparsity levels.