Contrastive Post-training Large Language Models on Data Curriculum

Xu, Canwen, Rosset, Corby, Del Corro, Luciano, Mahajan, Shweti, McAuley, Julian, Neville, Jennifer, Awadallah, Ahmed Hassan, Rao, Nikhil

Oct-3-2023–arXiv.org Artificial Intelligence

Alignment serves as an important step to steer large language models (LLMs) towards human preferences. In this paper, we explore contrastive post-training techniques for alignment by automatically constructing preference pairs from multiple models of varying strengths (e.g., InstructGPT, ChatGPT and GPT-4). We carefully compare the contrastive techniques of SLiC and DPO to SFT baselines and find that DPO provides a step-function improvement even after continueing SFT saturates. We also explore a data curriculum learning scheme for contrastive posttraining, which starts by learning from "easier" pairs and transitioning to "harder" ones, which further improves alignment. Finally, we scale up our experiments to train with more data and larger models like Orca. Remarkably, contrastive post-training further improves the performance of Orca, already a state-of-the-art instruction learning model tuned with GPT-4 outputs, to exceed that of ChatGPT. The rapid evolution of Large Language Models (LLMs) has ushered in a new era of natural language processing capabilities. These models, when scaled to billions of parameters and pretrained over trillions of text tokens, demonstrate unprecedented proficiency in a wide array of tasks (Brown et al., 2020; Chowdhery et al., 2022). Various post-training procedures like supervised instruction tuning and Reinforcement Learning from Human Feedback (RLHF) fine-tune pretrained LLMs to better align with human expectations and preferences (Ouyang et al., 2022; OpenAI, 2023; Touvron et al., 2023a). This additional alignment procedure is crucial, because the pretraining objective of essentially predicting the next token in a text sequence is known to produce LLMs whose outputs are at times incorrect, irrelevant, or unsafe (Bai et al., 2022a). Traditionally, these post-training techniques rely on human preference annotations to inform an LLM which behaviors it ought to adopt in the scenario at hand. For instance, RLHF fits a reward model on these preference pairs, against which a LLM policy is then optimized (Ziegler et al., 2019; Bai et al., 2022a; Touvron et al., 2023b). However, such human feedback is expensive to obtain and often noisy (Stiennon et al., 2020; Ouyang et al., 2022; Bai et al., 2022a). To align an LLM without human feedback, other methods such as Reinforcement Learning from AI Feedback (RLAIF) harvest preference signals via automatic feedback from another LLM (Lee et al., 2023; Bai et al., 2022b). However, studies have found AI feedback has a low agreement rate with humans (Perez et al., 2022; Casper et al., 2023b; Lee et al., 2021). Also, these methods suffer from the same drawbacks as RLHF, such as reward hacking (Skalse et al., 2022).

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Oct-3-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States > California (0.14)

Genre:
- Research Report > New Finding (0.68)

Industry:
- Education (0.68)
- Energy > Oil & Gas
  - Downstream (0.93)
- Health & Medicine
  - Consumer Health (1.00)
  - Therapeutic Area (0.68)
- Materials (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found