Yu, Mengxia
QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation
Nguyen, Bang, Du, Tingting, Yu, Mengxia, Angrave, Lawrence, Jiang, Meng
While the Question Generation (QG) task has been increasingly adopted in educational assessments, its evaluation remains limited by approaches that lack a clear connection to the educational values of test items. In this work, we introduce test item analysis, a method frequently used by educators to assess test question quality, into QG evaluation. Specifically, we construct pairs of candidate questions that differ in quality across dimensions such as topic coverage, item difficulty, item discrimination, and distractor efficiency. We then examine whether existing QG evaluation approaches can effectively distinguish these differences. Our findings reveal significant shortcomings in these approaches with respect to accurately assessing test item quality in relation to student performance. To address this gap, we propose a novel QG evaluation framework, QG-SMS, which leverages Large Language Model for Student Modeling and Simulation to perform test item analysis. As demonstrated in our extensive experiments and human evaluation study, the additional perspectives introduced by the simulated student profiles lead to a more effective and robust assessment of test items.
The Super Weight in Large Language Models
Yu, Mengxia, Wang, De, Shan, Qi, Reed, Colorado, Wan, Alvin
Recent works have shown a surprising result: a small fraction of Large Language Model (LLM) parameter outliers are disproportionately important to the quality of the model. LLMs contain billions of parameters, so these small fractions, such as 0.01%, translate to hundreds of thousands of parameters. In this work, we present an even more surprising finding: Pruning as few as a single parameter can destroy an LLM's ability to generate text - increasing perplexity by 3 orders of magnitude and reducing zero-shot accuracy to guessing. We propose a data-free method for identifying such parameters, termed super weights, using a single forward pass through the model. We additionally find that these super weights induce correspondingly rare and large activation outliers, termed super activations. When preserved with high precision, super activations can improve simple round-to-nearest quantization to become competitive with state-of-the-art methods. For weight quantization, we similarly find that by preserving the super weight and clipping other weight outliers, round-to-nearest quantization can scale to much larger block sizes than previously considered. Large Language Models (LLMs) have been growing in size and capability at an unprecedented rate, enabling them to capture increasingly complex linguistic patterns across a wide range of tasks. However, with this increase in model scale, new and unexpected behaviors have emerged. Dettmers et al. (2022) discovered that once LLMs reach a certain scale, a small set of hidden state features contains outliers of exceptionally large magnitude.
Reference-based Metrics Disprove Themselves in Question Generation
Nguyen, Bang, Yu, Mengxia, Huang, Yun, Jiang, Meng
Reference-based metrics such as BLEU and BERTScore are widely used to evaluate question generation (QG). In this study, on QG benchmarks such as SQuAD and HotpotQA, we find that using human-written references cannot guarantee the effectiveness of the reference-based metrics. Most QG benchmarks have only one reference; we replicated the annotation process and collect another reference. A good metric was expected to grade a human-validated question no worse than generated questions. However, the results of reference-based metrics on our newly collected reference disproved the metrics themselves. We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity, utilizing large language models. These criteria are not constrained to the syntactic or semantic of a single reference question, and the metric does not require a diverse set of references. Experiments reveal that our metric accurately distinguishes between high-quality questions and flawed ones, and achieves state-of-the-art alignment with human judgment.
Pre-training Language Models for Comparative Reasoning
Yu, Mengxia, Zhang, Zhihan, Yu, Wenhao, Jiang, Meng
Comparative reasoning is a process of comparing objects, concepts, or entities to draw conclusions, which constitutes a fundamental cognitive ability. In this paper, we propose a novel framework to pre-train language models for enhancing their abilities of comparative reasoning over texts. While there have been approaches for NLP tasks that require comparative reasoning, they suffer from costly manual data labeling and limited generalizability to different tasks. Our approach introduces a novel method of collecting scalable data for text-based entity comparison, which leverages both structured and unstructured data. Moreover, we present a framework of pre-training language models via three novel objectives on comparative reasoning. Evaluation on downstream tasks including comparative question answering, question generation, and summarization shows that our pre-training framework significantly improves the comparative reasoning abilities of language models, especially under low-resource conditions. This work also releases the first integrated benchmark for comparative reasoning.
A Survey of Multi-task Learning in Natural Language Processing: Regarding Task Relatedness and Training Methods
Zhang, Zhihan, Yu, Wenhao, Yu, Mengxia, Guo, Zhichun, Jiang, Meng
By focusing on one such two "how to share" categories into task, the model ignores knowledge from the training five categories, including feature learning approach, signals of related tasks (Ruder, 2017). There low-rank approach, task clustering approach, task are a great number of tasks in NLP, from syntax relation learning approach, and decomposition approach; parsing to information extraction, from machine Crawshaw (2020) presented more recent translation to question answering: each requires models in both single-domain and multi-modal architectures, a model dedicated to learning from data. Biologically, as well as an overview of optimization humans learn natural languages, from basic methods in MTL. Nevertheless, it is still not clearly grammar to complex semantics in a single brain understood how to design and train a single model (Hashimoto et al., 2017). In the field of machine to handle a variety of NLP tasks according to task learning, multi-task learning (MTL) aims to leverage relatedness. Especially when faced with a set of useful information shared across multiple related tasks that are seldom simultaneously trained previously, tasks to improve the generalization performance it is of crucial importance that researchers on all tasks (Caruana, 1997). In deep neural find proper auxiliary tasks and assess the feasibility networks, it is generally achieved by sharing part of of such multi-task learning attempt.
Validating Label Consistency in NER Data Annotation
Zeng, Qingkai, Yu, Mengxia, Yu, Wenhao, Jiang, Tianwen, Weninger, Tim, Jiang, Meng
Data annotation plays a crucial role in ensuring your named entity recognition (NER) projects are trained with the right information to learn from. Producing the most accurate labels is a challenge due to the complexity involved with annotation. Label inconsistency between multiple subsets of data annotation (e.g., training set and test set, or multiple training subsets) is an indicator of label mistakes. In this work, we present an empirical method to explore the relationship between label (in-)consistency and NER model performance. It can be used to validate the label consistency (or catches the inconsistency) in multiple sets of NER data annotation. In experiments, our method identified the label inconsistency of test data in SCIERC and CoNLL03 datasets (with 26.7% and 5.4% label mistakes). It validated the consistency in the corrected version of both datasets.