Goto

Collaborating Authors

 qqp


a1140a3d0df1c81e24ae954d935e8926-Supplemental.pdf

Neural Information Processing Systems

PL 1 i=l fRT(xi)) would be, and another term of E XL( Xl PL 1 i=l fRT(Xi)) that propagates through theTransformer blocks. Figure 1 shows the full comparison of the baseline and PLD, fine-tuned at different checkpoints. Overall, we observe that PLD not only trains BERT faster in pre-training but also preserves the performanceondownstreamtasks. Results are visualized in Figure 1, which shows that the baseline is less robust on the choice of learningrates.



A of Main Results

Neural Information Processing Systems

B.1 Additional V ariants We also conducted ablations on several variants of GAPX. Specifically, GAPX(neg-log) modifies the Eqn. 5 and Eqn. C.2 Interpreting the Results Figure 6: An example from QQP illustrating how to interpret the result of our method, by OODP . For GAP, we can use the the score defined in Eqn.. 4 split on each word, namely: In all three models, higher scores represent a higher chance of being non-paraphrases. GAP, the threshold is 0 while for OODP the threshold is 3. Its reliance on the word'a' might be due to the error The metrics are calculated as follow: 1. We use the RoBERTa model described in Section 4.2


Preventing Catastrophic Forgetting in Continual Learning of New Natural Language Tasks

Kar, Sudipta, Castellucci, Giuseppe, Filice, Simone, Malmasi, Shervin, Rokhlenko, Oleg

arXiv.org Artificial Intelligence

Multi-Task Learning (MTL) is widely-accepted in Natural Language Processing as a standard technique for learning multiple related tasks in one model. Training an MTL model requires having the training data for all tasks available at the same time. As systems usually evolve over time, (e.g., to support new functionalities), adding a new task to an existing MTL model usually requires retraining the model from scratch on all the tasks and this can be time-consuming and computationally expensive. Moreover, in some scenarios, the data used to train the original training may be no longer available, for example, due to storage or privacy concerns. In this paper, we approach the problem of incrementally expanding MTL models' capability to solve new tasks over time by distilling the knowledge of an already trained model on n tasks into a new one for solving n+1 tasks. To avoid catastrophic forgetting, we propose to exploit unlabeled data from the same distributions of the old tasks. Our experiments on publicly available benchmarks show that such a technique dramatically benefits the distillation by preserving the already acquired knowledge (i.e., preventing up to 20% performance drops on old tasks) while obtaining good performance on the incrementally added tasks. Further, we also show that our approach is beneficial in practical settings by using data from a leading voice assistant.


Privacy Adhering Machine Un-learning in NLP

Kumar, Vinayshekhar Bannihatti, Gangadharaiah, Rashmi, Roth, Dan

arXiv.org Artificial Intelligence

Regulations introduced by General Data Protection Regulation (GDPR) in the EU or California Consumer Privacy Act (CCPA) in the US have included provisions on the \textit{right to be forgotten} that mandates industry applications to remove data related to an individual from their systems. In several real world industry applications that use Machine Learning to build models on user data, such mandates require significant effort both in terms of data cleansing as well as model retraining while ensuring the models do not deteriorate in prediction quality due to removal of data. As a result, continuous removal of data and model retraining steps do not scale if these applications receive such requests at a very high frequency. Recently, a few researchers proposed the idea of \textit{Machine Unlearning} to tackle this challenge. Despite the significant importance of this task, the area of Machine Unlearning is under-explored in Natural Language Processing (NLP) tasks. In this paper, we explore the Unlearning framework on various GLUE tasks \cite{Wang:18}, such as, QQP, SST and MNLI. We propose computationally efficient approaches (SISA-FC and SISA-A) to perform \textit{guaranteed} Unlearning that provides significant reduction in terms of both memory (90-95\%), time (100x) and space consumption (99\%) in comparison to the baselines while keeping model performance constant.


GAPX: Generalized Autoregressive Paraphrase-Identification X

Zhou, Yifei, Li, Renyu, Housen, Hayden, Lim, Ser-Nam

arXiv.org Artificial Intelligence

Paraphrase Identification is a fundamental task in Natural Language Processing. While much progress has been made in the field, the performance of many state-of-the-art models often suffer from distribution shift during inference time. We verify that a major source of this performance drop comes from biases introduced by negative examples. To overcome these biases, we propose in this paper to train two separate models, one that only utilizes the positive pairs and the other the negative pairs. This enables us the option of deciding how much to utilize the negative model, for which we introduce a perplexity based out-of-distribution metric that we show can effectively and automatically determine how much weight it should be given during inference. We support our findings with strong empirical results.


Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark

Nangia, Nikita, Bowman, Samuel R.

arXiv.org Artificial Intelligence

The GLUE benchmark (Wang et al., 2019b) is a suite of language understanding tasks which has seen dramatic progress in the past year, with average performance moving from 70.0 at launch to 83.9, state of the art at the time of writing (May 24, 2019). Here, we measure human performance on the benchmark, in order to learn whether significant headroom remains for further progress. We provide a conservative estimate of human performance on the benchmark through crowdsourcing: Our annotators are non-experts who must learn each task from a brief set of instructions and 20 examples. In spite of limited training, these annotators robustly outperform the state of the art on six of the nine GLUE tasks and achieve an average score of 87.1. Given the fast pace of progress however, the headroom we observe is quite limited. To reproduce the data-poor setting that our annotators must learn in, we also train the BERT model (Devlin et al., 2019) in limited-data regimes, and conclude that low-resource sentence classification remains a challenge for modern neural network approaches to text understanding.