Plotting

 Li, Chengming


Layer-wise Regularized Dropout for Neural Language Models

arXiv.org Artificial Intelligence

Among the various pre-trained neural language models that are popular today, dropout is already an indispensable regularization technique. To solve the inconsistency between training and inference caused by the randomness of dropout, some studies use consistency training to regularize dropout at the output layer. In this paper, we propose a novel Layer-wise Regularized Dropout (LR-Drop), which is specially designed for Transformer-based Language models. Specifically, LR-Drop layer-wise regularizes each Transformer layer using the consistency training strategy. Each training sample passes through the two siamese sub-models sampled by dropout, and then LR-Drop forces the hidden states, multi-head attention matrices, and output distribution of the two siamese sub-models to be consistent. The proposed LR-Drop can be regarded as a "self-distillation" framework, in which each sub-model generated by dropout is the other's "teacher" model and "student" model. Through extensive experiments on 8 natural language understanding datasets, 6 neural machine translation datasets, and 1 abstractive summarization dataset (a total of 15 datasets), we show that LR-Drop achieves superior performances, including state-of-the-art results.


Forgetting before Learning: Utilizing Parametric Arithmetic for Knowledge Updating in Large Language Models

arXiv.org Artificial Intelligence

Recent advancements in Large Language Models (LLMs) have showcased their remarkable capabilities in text understanding and generation. However, even stronger LLMs are susceptible to acquiring erroneous or obsolete information from the training corpus. Direct secondary fine-tuning with data containing new knowledge may be ineffective in updating knowledge due to the conflict between old and new knowledge. In this paper, we propose a new paradigm for fine-tuning called F-Learning (Forgetting before Learning), which employs parametric arithmetic to facilitate the forgetting of old knowledge and learning of new knowledge. Experimental results on two publicly available datasets demonstrate that our proposed F-Learning can obviously improve the knowledge updating performance of both full fine-tuning and LoRA fine-tuning, simultaneously outperforming the existing baselines in most cases. Moreover, we have also discovered that forgetting old knowledge by subtracting the parameters of LoRA can yield a similar effect to subtracting the parameters of full fine-tuning, and occasionally even surpass it significantly.


History, Development, and Principles of Large Language Models-An Introductory Survey

arXiv.org Artificial Intelligence

Language models serve as a cornerstone in natural language processing (NLP), utilizing mathematical methods to generalize language laws and knowledge for prediction and generation. Over extensive research spanning decades, language modeling has progressed from initial statistical language models (SLMs) to the contemporary landscape of large language models (LLMs). Notably, the swift evolution of LLMs has reached the ability to process, understand, and generate human-level text. Nevertheless, despite the significant advantages that LLMs offer in improving both work and personal lives, the limited understanding among general practitioners about the background and principles of these models hampers their full potential. Notably, most LLMs reviews focus on specific aspects and utilize specialized language, posing a challenge for practitioners lacking relevant background knowledge. In light of this, this survey aims to present a comprehensible overview of LLMs to assist a broader audience. It strives to facilitate a comprehensive understanding by exploring the historical background of language models and tracing their evolution over time. The survey further investigates the factors influencing the development of LLMs, emphasizing key contributions. Additionally, it concentrates on elucidating the underlying principles of LLMs, equipping audiences with essential theoretical knowledge. The survey also highlights the limitations of existing work and points out promising future directions.


E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models

arXiv.org Artificial Intelligence

With the accelerating development of Large Language Models (LLMs), many LLMs are beginning to be used in the Chinese K-12 education domain. The integration of LLMs and education is getting closer and closer, however, there is currently no benchmark for evaluating LLMs that focuses on the Chinese K-12 education domain. Therefore, there is an urgent need for a comprehensive natural language processing benchmark to accurately assess the capabilities of various LLMs in the Chinese K-12 education domain. To address this, we introduce the E-EVAL, the first comprehensive evaluation benchmark specifically designed for the Chinese K-12 education field. The E-EVAL consists of 4,351 multiple-choice questions at the primary, middle, and high school levels across a wide range of subjects, including Chinese, English, Politics, History, Ethics, Physics, Chemistry, Mathematics, and Geography. We conducted a comprehensive evaluation of E-EVAL on advanced LLMs, including both English-dominant and Chinese-dominant models. Findings show that Chinese-dominant models perform well compared to English-dominant models, with many scoring even above the GPT 4.0. However, almost all models perform poorly in complex subjects such as mathematics. We also found that most Chinese-dominant LLMs did not achieve higher scores at the primary school level compared to the middle school level. We observe that the mastery of higher-order knowledge by the model does not necessarily imply the mastery of lower-order knowledge as well. Additionally, the experimental results indicate that the Chain of Thought (CoT) technique is effective only for the challenging science subjects, while Few-shot prompting is more beneficial for liberal arts subjects. With E-EVAL, we aim to analyze the strengths and limitations of LLMs in educational applications, and to contribute to the progress and development of Chinese K-12 education and LLMs.


Expression Syntax Information Bottleneck for Math Word Problems

arXiv.org Artificial Intelligence

Math Word Problems (MWP) aims to automatically solve mathematical questions given in texts. Previous studies tend to design complex models to capture additional information in the original text so as to enable the model to gain more comprehensive features. In this paper, we turn our attention in the opposite direction, and work on how to discard redundant features containing spurious correlations for MWP. To this end, we design an Expression Syntax Information Bottleneck method for MWP (called ESIB) based on variational information bottleneck, which extracts essential features of expression syntax tree while filtering latent-specific redundancy containing syntax-irrelevant features. The key idea of ESIB is to encourage multiple models to predict the same expression syntax tree for different problem representations of the same problem by mutual learning so as to capture consistent information of expression syntax tree and discard latent-specific redundancy. To improve the generalization ability of the model and generate more diverse expressions, we design a self-distillation loss to encourage the model to rely more on the expression syntax information in the latent space. Experimental results on two large-scale benchmarks show that our model not only achieves state-of-the-art results but also generates more diverse solutions. The code is available.


DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning

arXiv.org Artificial Intelligence

Recent advances in natural language processing, primarily propelled by Large Language Models (LLMs), have showcased their remarkable capabilities grounded in in-context learning. A promising avenue for guiding LLMs in intricate reasoning tasks involves the utilization of intermediate reasoning steps within the Chain-of-Thought (CoT) paradigm. Nevertheless, the central challenge lies in the effective selection of exemplars for facilitating in-context learning. In this study, we introduce a framework that leverages Dual Queries and Low-rank approximation Re-ranking (DQ-LoRe) to automatically select exemplars for in-context learning. Dual Queries first query LLM to obtain LLM-generated knowledge such as CoT, then query the retriever to obtain the final exemplars via both question and the knowledge. Moreover, for the second query, LoRe employs dimensionality reduction techniques to refine exemplar selection, ensuring close alignment with the input question's knowledge. Through extensive experiments, we demonstrate that DQ-LoRe significantly outperforms prior state-of-the-art methods in the automatic selection of exemplars for GPT-4, enhancing performance from 92.5% to 94.2%. Our comprehensive analysis further reveals that DQ-LoRe consistently outperforms retrieval-based approaches in terms of both performance and adaptability, especially in scenarios characterized by distribution shifts. DQ-LoRe pushes the boundaries of in-context learning and opens up new avenues for addressing complex reasoning challenges. We will release the code soon.


Self-Distillation with Meta Learning for Knowledge Graph Completion

arXiv.org Artificial Intelligence

In this paper, we propose a selfdistillation framework with meta learning(MetaSD) for knowledge graph completion with dynamic pruning, which aims to learn compressed graph embeddings and tackle the longtail samples. Specifically, we first propose a dynamic pruning technique to obtain a small pruned model from a large source model, where the pruning mask of the pruned model could be updated adaptively per epoch after the model weights are updated. The pruned model is supposed to be more sensitive to difficult to memorize samples(e.g., longtail samples) than the source model. Then, we propose a onestep meta selfdistillation method for distilling comprehensive knowledge from the source model to the pruned model, where the two models coevolve in a dynamic manner during training. In particular, we exploit the performance of the pruned model, which is trained alongside the source model in one iteration, to improve the source models knowledge transfer ability for the next iteration via meta learning. Extensive experiments show that MetaSD achieves competitive performance compared to strong baselines, while being 10x smaller than baselines.


Scene-adaptive Knowledge Distillation for Sequential Recommendation via Differentiable Architecture Search

arXiv.org Artificial Intelligence

Sequential recommender systems (SRS) have become a research hotspot due to its power in modeling user dynamic interests and sequential behavioral patterns. To maximize model expressive ability, a default choice is to apply a larger and deeper network architecture, which, however, often brings high network latency when generating online recommendations. Naturally, we argue that compressing the heavy recommendation models into middle- or light- weight neural networks is of great importance for practical production systems. To realize such a goal, we propose AdaRec, a knowledge distillation (KD) framework which compresses knowledge of a teacher model into a student model adaptively according to its recommendation scene by using differentiable Neural Architecture Search (NAS). Specifically, we introduce a target-oriented distillation loss to guide the structure search process for finding the student network architecture, and a cost-sensitive loss as constraints for model size, which achieves a superior trade-off between recommendation effectiveness and efficiency. In addition, we leverage Earth Mover's Distance (EMD) to realize many-to-many layer mapping during knowledge distillation, which enables each intermediate student layer to learn from other intermediate teacher layers adaptively. Extensive experiments on real-world recommendation datasets demonstrate that our model achieves competitive or better accuracy with notable inference speedup comparing to strong counterparts, while discovering diverse neural architectures for sequential recommender models under different recommendation scenes.