Jin, Lifeng
The Trickle-down Impact of Reward (In-)consistency on RLHF
Shen, Lingfeng, Chen, Sihao, Song, Linfeng, Jin, Lifeng, Peng, Baolin, Mi, Haitao, Khashabi, Daniel, Yu, Dong
Standard practice within Reinforcement Learning from Human Feedback (RLHF) involves optimizing against a Reward Model (RM), which itself is trained to reflect human preferences for desirable generations. A notable subject that is understudied is the (in-)consistency of RMs -- whether they can recognize the semantic changes to different prompts and appropriately adapt their reward assignments -- and their impact on the downstream RLHF model. In this paper, we visit a series of research questions relevant to RM inconsistency: (1) How can we measure the consistency of reward models? (2) How consistent are the existing RMs and how can we improve them? (3) In what ways does reward inconsistency influence the chatbots resulting from the RLHF model training? We propose Contrast Instructions -- a benchmarking strategy for the consistency of RM. Each example in Contrast Instructions features a pair of lexically similar instructions with different ground truth responses. A consistent RM is expected to rank the corresponding instruction and response higher than other combinations. We observe that current RMs trained with the standard ranking objective fail miserably on Contrast Instructions compared to average humans. To show that RM consistency can be improved efficiently without using extra training budget, we propose two techniques ConvexDA and RewardFusion, which enhance reward consistency through extrapolation during the RM training and inference stage, respectively. We show that RLHF models trained with a more consistent RM yield more useful responses, suggesting that reward inconsistency exhibits a trickle-down effect on the downstream RLHF process.
Stabilizing RLHF through Advantage Model and Selective Rehearsal
Peng, Baolin, Song, Linfeng, Tian, Ye, Jin, Lifeng, Mi, Haitao, Yu, Dong
Large Language Models (LLMs) have revolutionized natural language processing, yet aligning these models with human values and preferences using RLHF remains a significant challenge. This challenge is characterized by various instabilities, such as reward hacking and catastrophic forgetting. In this technical report, we propose two innovations to stabilize RLHF training: (i) Advantage Model, which directly models advantage score i.e., extra reward compared to the expected rewards and regulates score distributions across tasks to prevent reward hacking. Large language models (LLMs) have become a fundamental element in advancing natural language processing (NLP) and artificial intelligence (AI), showcasing an impressive ability to generate text that is both semantically and contextually relevant (OpenAI, 2023; Kรถpf et al., 2023; Touvron et al., 2023). Despite these advancements, LLMs have the risk of engaging in undesirable behaviors, such as fabricating information or producing biased, toxic, or even dangerous content, since LLMs are trained on a wide array of data, which can include low-quality sources. This has highlighted the necessities of LLM Alignments with human values, intentions, and preferences (Brown et al., 2020; Ouyang et al., 2022; Bai et al., 2022a; Glaese et al., 2022). Many approaches have been put forth to address the challenge LLM Alignments (Bai et al., 2022a; OpenAI, 2023; Askell et al., 2021). Among these approaches, Reinforcement Learning from Human Feedback (RLHF) has demonstrated its efficacy in aligning language models with human preferences.
Discover, Explanation, Improvement: An Automatic Slice Detection Framework for Natural Language Processing
Hua, Wenyue, Jin, Lifeng, Song, Linfeng, Mi, Haitao, Zhang, Yongfeng, Yu, Dong
Pretrained natural language processing (NLP) models have achieved high overall performance, but they still make systematic errors. Instead of manual error analysis, research on slice detection models (SDM), which automatically identify underperforming groups of datapoints, has caught escalated attention in Computer Vision for both understanding model behaviors and providing insights for future model training and designing. However, little research on SDM and quantitative evaluation of their effectiveness have been conducted on NLP tasks. Our paper fills the gap by proposing a benchmark named "Discover, Explain, Improve (DEIM)" for classification NLP tasks along with a new SDM Edisa. Edisa discovers coherent and underperforming groups of datapoints; DEIM then unites them under human-understandable concepts and provides comprehensive evaluation tasks and corresponding quantitative metrics. The evaluation in DEIM shows that Edisa can accurately select error-prone datapoints with informative semantic features that summarize error patterns. Detecting difficult datapoints directly boosts model performance without tuning any original model parameters, showing that discovered slices are actionable for users.
Friend-training: Learning from Models of Different but Related Tasks
Zhang, Mian, Jin, Lifeng, Song, Linfeng, Mi, Haitao, Zhou, Xiabing, Yu, Dong
Current self-training methods such as standard self-training, co-training, tri-training, and others often focus on improving model performance on a single task, utilizing differences in input features, model architectures, and training processes. However, many tasks in natural language processing are about different but related aspects of language, and models trained for one task can be great teachers for other related tasks. In this work, we propose friend-training, a cross-task self-training framework, where models trained to do different tasks are used in an iterative training, pseudo-labeling, and retraining process to help each other for better selection of pseudo-labels. With two dialogue understanding tasks, conversational semantic role labeling and dialogue rewriting, chosen for a case study, we show that the models trained with the friend-training framework achieve the best performance compared to strong baselines.
Hierarchical Context Tagging for Utterance Rewriting
Jin, Lisa, Song, Linfeng, Jin, Lifeng, Yu, Dong, Gildea, Daniel
Utterance rewriting aims to recover coreferences and omitted information from the latest turn of a multi-turn dialogue. Recently, methods that tag rather than linearly generate sequences have proven stronger in both in- and out-of-domain rewriting settings. This is due to a tagger's smaller search space as it can only copy tokens from the dialogue context. However, these methods may suffer from low coverage when phrases that must be added to a source utterance cannot be covered by a single context span. This can occur in languages like English that introduce tokens such as prepositions into the rewrite for grammaticality. We propose a hierarchical context tagger (HCT) that mitigates this issue by predicting slotted rules (e.g., "besides_") whose slots are later filled with context spans. HCT (i) tags the source string with token-level edit actions and slotted rules and (ii) fills in the resulting rule slots with spans from the dialogue context. This rule tagging allows HCT to add out-of-context tokens and multiple spans at once; we further cluster the rules to truncate the long tail of the rule distribution. Experiments on several benchmarks show that HCT can outperform state-of-the-art rewriting systems by ~2 BLEU points.
Domain-Adaptive Pretraining Methods for Dialogue Understanding
Wu, Han, Xu, Kun, Song, Linfeng, Jin, Lifeng, Zhang, Haisong, Song, Linqi
Language models like BERT and SpanBERT pretrained on open-domain data have obtained impressive gains on various NLP tasks. In this paper, we probe the effectiveness of domain-adaptive pretraining objectives on downstream tasks. In particular, three objectives, including a novel objective focusing on modeling predicate-argument relations, are evaluated on two challenging dialogue understanding tasks. Experimental results demonstrate that domain-adaptive pretraining with proper objectives can significantly improve the performance of a strong baseline on these tasks, achieving the new state-of-the-art performances.