Not enough data to create a plot.
Try a different view from the menu above.
Chen, Weizhu
A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation
Liu, Tianyu, Zhang, Yizhe, Brockett, Chris, Mao, Yi, Sui, Zhifang, Chen, Weizhu, Dolan, Bill
Large pretrained generative models like GPT-3 often suffer from hallucinating non-existent or incorrect content, which undermines their potential merits in real applications. Existing work usually attempts to detect these hallucinations based on a corresponding oracle reference at a sentence or document level. However ground-truth references may not be readily available for many free-form text generation applications, and sentence- or document-level detection may fail to provide the fine-grained signals that would prevent fallacious content in real time. As a first step to addressing these issues, we propose a novel token-level, reference-free hallucination detection task and an associated annotated dataset named HaDes (HAllucination DEtection dataSet). To create this dataset, we first perturb a large number of text segments extracted from English language Wikipedia, and then verify these with crowd-sourced annotations. To mitigate label imbalance during annotation, we utilize an iterative model-in-loop strategy. We conduct comprehensive data analyses and create multiple baseline models.
Reader-Guided Passage Reranking for Open-Domain Question Answering
Mao, Yuning, He, Pengcheng, Liu, Xiaodong, Shen, Yelong, Gao, Jianfeng, Han, Jiawei, Chen, Weizhu
Current open-domain question answering (QA) systems often follow a Retriever-Reader (R2) architecture, where the retriever first retrieves relevant passages and the reader then reads the retrieved passages to form an answer. In this paper, we propose a simple and effective passage reranking method, Reader-guIDEd Reranker (Rider), which does not involve any training and reranks the retrieved passages solely based on the top predictions of the reader before reranking. We show that Rider, despite its simplicity, achieves 10 to 20 absolute gains in top-1 retrieval accuracy and 1 to 4 Exact Match (EM) score gains without refining the retriever or reader. In particular, Rider achieves 48.3 EM on the Natural Questions dataset and 66.4 on the TriviaQA dataset when only 1,024 tokens (7.8 passages on average) are used as the reader input.
UnitedQA: A Hybrid Approach for Open Domain Question Answering
Cheng, Hao, Shen, Yelong, Liu, Xiaodong, He, Pengcheng, Chen, Weizhu, Gao, Jianfeng
To date, most of recent work under the retrieval-reader framework for open-domain QA focuses on either extractive or generative reader exclusively. In this paper, we study a hybrid approach for leveraging the strengths of both models. We apply novel techniques to enhance both extractive and generative readers built upon recent pretrained neural language models, and find that proper training methods can provide large improvement over previous state-of-the-art models. We demonstrate that a simple hybrid approach by combining answers from both readers can efficiently take advantages of extractive and generative answer inference strategies and outperforms single models as well as homogeneous ensembles. Our approach outperforms previous state-of-the-art models by 3.3 and 2.7 points in exact match on NaturalQuestions and TriviaQA respectively.
NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned
Min, Sewon, Boyd-Graber, Jordan, Alberti, Chris, Chen, Danqi, Choi, Eunsol, Collins, Michael, Guu, Kelvin, Hajishirzi, Hannaneh, Lee, Kenton, Palomaki, Jennimaria, Raffel, Colin, Roberts, Adam, Kwiatkowski, Tom, Lewis, Patrick, Wu, Yuxiang, Kรผttler, Heinrich, Liu, Linqing, Minervini, Pasquale, Stenetorp, Pontus, Riedel, Sebastian, Yang, Sohee, Seo, Minjoon, Izacard, Gautier, Petroni, Fabio, Hosseini, Lucas, De Cao, Nicola, Grave, Edouard, Yamada, Ikuya, Shimaoka, Sonse, Suzuki, Masatoshi, Miyawaki, Shumpei, Sato, Shun, Takahashi, Ryo, Suzuki, Jun, Fajcik, Martin, Docekal, Martin, Ondrej, Karel, Smrz, Pavel, Cheng, Hao, Shen, Yelong, Liu, Xiaodong, He, Pengcheng, Chen, Weizhu, Gao, Jianfeng, Oguz, Barlas, Chen, Xilun, Karpukhin, Vladimir, Peshterliev, Stan, Okhonko, Dmytro, Schlichtkrull, Michael, Gupta, Sonal, Mehdad, Yashar, Yih, Wen-tau
We review the EfficientQA competition from NeurIPS 2020. The competition focused on open-domain question answering (QA), where systems take natural language questions as input and return natural language answers. The aim of the competition was to build systems that can predict correct answers while also satisfying strict on-disk memory budgets. These memory budgets were designed to encourage contestants to explore the trade-off between storing large, redundant, retrieval corpora or the parameters of large learned models. In this report, we describe the motivation and organization of the competition, review the best submissions, and analyze system predictions to inform a discussion of evaluation for open-domain QA.
MixKD: Towards Efficient Distillation of Large-scale Language Models
Liang, Kevin J, Hao, Weituo, Shen, Dinghan, Zhou, Yufan, Chen, Weizhu, Chen, Changyou, Carin, Lawrence
Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their applicability to low-resource (memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such big models. However, large-scale neural network systems are prone to memorize training instances, and thus tend to make inconsistent predictions when the data distribution is altered slightly. Moreover, the student model has few opportunities to request useful information from the teacher model when there is limited task-specific data available. To address these issues, we propose MixKD, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability. Concretely, in addition to the original training examples, the student model is encouraged to mimic the teacher's behavior on the linear interpolation of example pairs as well. We prove, from a theoretical perspective, that under reasonable conditions MixKD gives rise to a smaller gap between the generalization error and the empirical error. To verify its effectiveness, we conduct experiments on the GLUE benchmark, where MixKD consistently leads to significant gains over the standard KD training, and outperforms several competitive baselines. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation
Shen, Dinghan, Zheng, Mingzhi, Shen, Yelong, Qu, Yanru, Chen, Weizhu
Adversarial training has been shown effective at endowing the learned representations with stronger generalization ability. However, it typically requires expensive computation to determine the direction of the injected perturbations. In this paper, we introduce a set of simple yet effective data augmentation strategies dubbed cutoff, where part of the information within an input sentence is erased to yield its restricted views (during the fine-tuning stage). Notably, this process relies merely on stochastic sampling and thus adds little computational overhead. A Jensen-Shannon Divergence consistency loss is further utilized to incorporate these augmented samples into the training objective in a principled manner. To verify the effectiveness of the proposed strategies, we apply cutoff to both natural language understanding and generation problems. On the GLUE benchmark, it is demonstrated that cutoff, in spite of its simplicity, performs on par or better than several competitive adversarial-based approaches. We further extend cutoff to machine translation and observe significant gains in BLEU scores (based upon the Transformer Base model). Moreover, cutoff consistently outperforms adversarial training and achieves state-of-the-art results on the IWSLT2014 German-English dataset.
Improving Self-supervised Pre-training via a Fully-Explored Masked Language Model
Zheng, Mingzhi, Shen, Dinghan, Shen, Yelong, Chen, Weizhu, Xiao, Lin
Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training. In this paper, we argue that randomly sampled masks in MLM would lead to undesirably large gradient variance. Thus, we theoretically quantify the gradient variance via correlating the gradient covariance with the Hamming distance between two different masks (given a certain text sequence). To reduce the variance due to the sampling of masks, we propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments. Thereafter, the tokens within one segment are masked for training. We prove, from a theoretical perspective, that the gradients derived from this new masking schema have a smaller variance and can lead to more efficient self-supervised training. We conduct extensive experiments on both continual pre-training and general pre-training from scratch. Empirical results confirm that this new masking strategy can consistently outperform standard random masking. Detailed efficiency analysis and ablation studies further validate the advantages of our fully-explored masking strategy under the MLM framework.
Understanding the Difficulty of Training Transformers
Liu, Liyuan, Liu, Xiaodong, Gao, Jianfeng, Chen, Weizhu, Han, Jiawei
Transformers have proved effective in many NLP tasks. However, their training requires non-trivial efforts regarding designing cutting-edge optimizers and learning rate schedulers carefully (e.g., conventional SGD fails to train Transformers effectively). Our objective here is to understand $\textit{what complicates Transformer training}$ from both empirical and theoretical perspectives. Our analysis reveals that unbalanced gradients are not the root cause of the instability of training. Instead, we identify an amplification effect that influences training substantially -- for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable, since it amplifies small parameter perturbations (e.g., parameter updates) and results in significant disturbances in the model output. Yet we observe that a light dependency limits the model potential and leads to inferior trained models. Inspired by our analysis, we propose Admin ($\textbf{Ad}$aptive $\textbf{m}$odel $\textbf{in}$itialization) to stabilize stabilize the early stage's training and unleash its full potential in the late stage. Extensive experiments show that Admin is more stable, converges faster, and leads to better performance. Implementations are released at: https://github.com/LiyuanLucasLiu/Transforemr-Clinic.
On the Variance of the Adaptive Learning Rate and Beyond
Liu, Liyuan, Jiang, Haoming, He, Pengcheng, Chen, Weizhu, Liu, Xiaodong, Gao, Jianfeng, Han, Jiawei
The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in details. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate (i.e., it has problematically large variance in the early stage), suggest warmup works as a variance reduction technique, and provide both empirical and theoretical evidence to verify our hypothesis. We further propose RAdam, a new variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Extensive experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the effectiveness and robustness of our proposed method. All implementations are available at: https://github.com/LiyuanLucasLiu/RAdam.
Lessons from Contextual Bandit Learning in a Customer Support Bot
Karampatziakis, Nikos, Kochman, Sebastian, Huang, Jade, Mineiro, Paul, Osborne, Kathy, Chen, Weizhu
In this work, we describe practical lessons we have learned from successfully using contextual bandits (CBs) to improve key business metrics of the Microsoft Virtual Agent for customer support. While our current use cases focus on single step einforcement learning (RL) and mostly in the domain of natural language processing and information retrieval we believe many of our findings are generally applicable. Through this article, we highlight certain issues that RL practitioners may encounter in similar types of applications as well as offer practical solutions to these challenges.