Inductive Learning
WeCheck: Strong Factual Consistency Checker via Weakly Supervised Learning
Wu, Wenhao, Li, Wei, Xiao, Xinyan, Liu, Jiachen, Li, Sujian, Lv, Yajuan
A crucial issue of current text generation models is that they often uncontrollably generate factually inconsistent text with respective of their inputs. Limited by the lack of annotated data, existing works in evaluating factual consistency directly transfer the reasoning ability of models trained on other data-rich upstream tasks like question answering (QA) and natural language inference (NLI) without any further adaptation. As a result, they perform poorly on the real generated text and are biased heavily by their single-source upstream tasks. To alleviate this problem, we propose a weakly supervised framework that aggregates multiple resources to train a precise and efficient factual metric, namely WeCheck. WeCheck first utilizes a generative model to accurately label a real generated sample by aggregating its weak labels, which are inferred from multiple resources. Then, we train the target metric model with the weak supervision while taking noises into consideration. Comprehensive experiments on a variety of tasks demonstrate the strong performance of WeCheck, which achieves a 3.4\% absolute improvement over previous state-of-the-art methods on TRUE benchmark on average.
Instance-based Max-margin for Practical Few-shot Recognition
Fu, Minghao, Zhu, Ke, Wu, Jianxin
In order to mimic the human few-shot learning (FSL) ability better and to make FSL closer to real-world applications, this paper proposes a practical FSL (pFSL) setting. pFSL is based on unsupervised pretrained models (analogous to human prior knowledge) and recognizes many novel classes simultaneously. Compared to traditional FSL, pFSL is simpler in its formulation, easier to evaluate, more challenging and more practical. To cope with the rarity of training examples, this paper proposes IbM2, an instance-based max-margin method not only for the new pFSL setting, but also works well in traditional FSL scenarios. Based on the Gaussian Annulus Theorem, IbM2 converts random noise applied to the instances into a mechanism to achieve maximum margin in the many-way pFSL (or traditional FSL) recognition task. Experiments with various self-supervised pretraining methods and diverse many- or few-way FSL tasks show that IbM2 almost always leads to improvements compared to its respective baseline methods, and in most cases the improvements are significant. With both the new pFSL setting and novel IbM2 method, this paper shows that practical few-shot learning is both viable and promising.
Theoretical and Practical Perspectives on what Influence Functions Do
Schioppa, Andrea, Filippova, Katja, Titov, Ivan, Zablotskaia, Polina
Influence functions (IF) have been seen as a technique for explaining model predictions through the lens of the training data. Their utility is assumed to be in identifying training examples "responsible" for a prediction so that, for example, correcting a prediction is possible by intervening on those examples (removing or editing them) and retraining the model. However, recent empirical studies have shown that the existing methods of estimating IF predict the leave-one-out-and-retrain effect poorly. In order to understand the mismatch between the theoretical promise and the practical results, we analyse five assumptions made by IF methods which are problematic for modern-scale deep neural networks and which concern convexity, numeric stability, training trajectory and parameter divergence. This allows us to clarify what can be expected theoretically from IF. We show that while most assumptions can be addressed successfully, the parameter divergence poses a clear limitation on the predictive power of IF: influence fades over training time even with deterministic training. We illustrate this theoretical result with BERT and ResNet models. Another conclusion from the theoretical analysis is that IF are still useful for model debugging and correcting even though some of the assumptions made in prior work do not hold: using natural language processing and computer vision tasks, we verify that mis-predictions can be successfully corrected by taking only a few fine-tuning steps on influential examples.
DIONYSUS: A Pre-trained Model for Low-Resource Dialogue Summarization
Li, Yu, Peng, Baolin, He, Pengcheng, Galley, Michel, Yu, Zhou, Gao, Jianfeng
Dialogue summarization has recently garnered significant attention due to its wide range of applications. However, existing methods for summarizing dialogues have limitations because they do not take into account the inherent structure of dialogue and rely heavily on labeled data, which can lead to poor performance in new domains. In this work, we propose DIONYSUS (dynamic input optimization in pre-training for dialogue summarization), a pre-trained encoder-decoder model for summarizing dialogues in any new domain. To pretrain DIONYSUS, we create two pseudo summaries for each dialogue example: one from a fine-tuned summarization model and the other from important dialogue turns. We then choose one of these pseudo summaries based on information distribution differences in different types of dialogues. This selected pseudo summary serves as the objective for pre-training DIONYSUS using a self-supervised approach Figure 1: A summary of a dialogue in the SAMSum on a large dialogue corpus. Our experiments dataset, where the golden summary effectively compiles show that DIONYSUS outperforms existing relevant information (in yellow) from the entire conversation.
Coping with low data availability for social media crisis message categorisation
During crisis situations, social media allows people to quickly share information, including messages requesting help. This can be valuable to emergency responders, who need to categorise and prioritise these messages based on the type of assistance being requested. However, the high volume of messages makes it difficult to filter and prioritise them without the use of computational techniques. Fully supervised filtering techniques for crisis message categorisation typically require a large amount of annotated training data, but this can be difficult to obtain during an ongoing crisis and is expensive in terms of time and labour to create. This thesis focuses on addressing the challenge of low data availability when categorising crisis messages for emergency response. It first presents domain adaptation as a solution for this problem, which involves learning a categorisation model from annotated data from past crisis events (source domain) and adapting it to categorise messages from an ongoing crisis event (target domain). In many-to-many adaptation, where the model is trained on multiple past events and adapted to multiple ongoing events, a multi-task learning approach is proposed using pre-trained language models. This approach outperforms baselines and an ensemble approach further improves performance...
Healing Unsafe Dialogue Responses with Weak Supervision Signals
Liang, Zi, Wang, Pinghui, Zhang, Ruofei, Zhang, Shuo, Huang, Xiaofan Ye Yi, Feng, Junlan
Recent years have seen increasing concerns about the unsafe response generation of large-scale dialogue systems, where agents will learn offensive or biased behaviors from the real-world corpus. Some methods are proposed to address the above issue by detecting and replacing unsafe training examples in a pipeline style. Though effective, they suffer from a high annotation cost and adapt poorly to unseen scenarios as well as adversarial attacks. Besides, the neglect of providing safe responses (e.g. simply replacing with templates) will cause the information-missing problem of dialogues. To address these issues, we propose an unsupervised pseudo-label sampling method, TEMP, that can automatically assign potential safe responses. Specifically, our TEMP method groups responses into several clusters and samples multiple labels with an adaptively sharpened sampling strategy, inspired by the observation that unsafe samples in the clusters are usually few and distribute in the tail. Extensive experiments in chitchat and task-oriented dialogues show that our TEMP outperforms state-of-the-art models with weak supervision signals and obtains comparable results under unsupervised learning settings.
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
Ruder, Sebastian, Clark, Jonathan H., Gutkin, Alexander, Kale, Mihir, Ma, Min, Nicosia, Massimo, Rijhwani, Shruti, Riley, Parker, Sarr, Jean-Michel A., Wang, Xinyi, Wieting, John, Gupta, Nitish, Katanova, Anna, Kirov, Christo, Dickinson, Dana L., Roark, Brian, Samanta, Bidisha, Tao, Connie, Adelani, David I., Axelrod, Vera, Caswell, Isaac, Cherry, Colin, Garrette, Dan, Ingle, Reeve, Johnson, Melvin, Panteleev, Dmitry, Talukdar, Partha
Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- languages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks -- tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text-only, multi-modal (vision, audio, and text),supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models
Understanding Label Bias in Single Positive Multi-Label Learning
Arroyo, Julio, Perona, Pietro, Cole, Elijah
Annotating data for multi-label classification is prohibitively expensive because every category of interest must be confirmed to be present or absent. Recent work on single positive multi-label (SPML) learning shows that it is possible to train effective multi-label classifiers using only one positive label per image. However, the standard benchmarks for SPML are derived from traditional multi-label classification datasets by retaining one positive label for each training example (chosen uniformly at random) and discarding all other labels. In realistic settings it is not likely that positive labels are chosen uniformly at random. This work introduces protocols for studying label bias in SPML and provides new empirical results.
Context-Aware Transformer Pre-Training for Answer Sentence Selection
Di Liello, Luca, Garg, Siddhant, Moschitti, Alessandro
Answer Sentence Selection (AS2) is a core component for building an accurate Question Answering pipeline. AS2 models rank a set of candidate sentences based on how likely they answer a given question. The state of the art in AS2 exploits pre-trained transformers by transferring them on large annotated datasets, while using local contextual information around the candidate sentence. In this paper, we propose three pre-training objectives designed to mimic the downstream fine-tuning task of contextual AS2. This allows for specializing LMs when fine-tuning for contextual AS2. Our experiments on three public and two large-scale industrial datasets show that our pre-training approaches (applied to RoBERTa and ELECTRA) can improve baseline contextual AS2 accuracy by up to 8% on some datasets.
TaxoKnow: Taxonomy as Prior Knowledge in the Loss Function of Multi-class Classification
Pourvali, Mohsen, Meng, Yao, Sheng, Chen, Du, Yangzhou
Chiriatti 2020), have made significant advances in Natural Language Processing (NLP). In general, pre-training, where a model first trains on massive amounts of data before being fine-tuned for a specific task, has proven to be assumption in the real world. Moreover, compared to human an efficient technique for improving the performance of a capabilities, DNNs still lack in various aspects, such wide range of language tasks (Min et al. 2021). For example, as Adaptability, Generalizability, Robustness, Explainability, BERT (Devlin et al. 2018) is a pre-trained transformerbased Abstraction, Common sense, and Causal reasoning. In encoder model that can be fine-tuned for various NLP general, Multi-Layer Perceptrons (MLPs) are good at generalizing tasks, such as sentence classification, question answering, within the space of training examples, but they perform and named entity recognition. In fact, large language models poorly at generalizing outside the space of training examples, have shown a so-called few-shot learning capability to be and this limitation is not improved even by adding efficiently adapted to downstream tasks.