Training Socially Aligned Language Models on Simulated Social Interactions
Liu, Ruibo, Yang, Ruixin, Jia, Chenyan, Zhang, Ge, Zhou, Denny, Dai, Andrew M., Yang, Diyi, Vosoughi, Soroush
–arXiv.org Artificial Intelligence
Social alignment in AI systems aims to ensure that these models behave according to established societal values. However, unlike humans, who derive consensus on value judgments through social interaction, current language models (LMs) are trained to rigidly replicate their training corpus in isolation, leading to subpar generalization in unfamiliar scenarios and vulnerability to adversarial attacks. This work presents a novel training paradigm that permits LMs to learn from simulated social interactions. In comparison to existing methodologies, our approach is considerably more scalable and efficient, demonstrating superior performance in alignment benchmarks and human evaluations. This paradigm shift in the training of LMs brings us a step closer to developing AI systems that can robustly and accurately reflect societal norms and values. "We want AI agents that can discover like we can, not which contain what we have discovered." Richard Sutton, The Bitter Lesson, 2019 By virtue of their ability to "predict the next token(s)", contemporary pre-trained language models (LMs) have shown remarkable proficiency in memorizing extensive corpora, thereby enabling the generation of text indistinguishable from human-produced content (Brown et al., 2020). However, successful memorization of human knowledge does not assure a model's propensity to perform as per societal expectations. Recent research has exposed behavioral anomalies in these LMs (Weidinger et al., 2022), which include the generation of harmful content (Gehman et al., 2020; Bommasani et al., 2021), the reinforcement of bias (Venkit et al., 2022; Liu et al., 2022), and the dissemination of disinformation (Tamkin et al., 2021; Lin et al., 2022). This process of enhancing desirable societal behaviors and inhibiting undesirable ones is commonly referred to as "social alignment" (Gabriel, 2020; Taylor et al., 2016). Supervised Fine-Tuning (SFT) presents a straightforward method for achieving alignment by training LMs using socially aligned data (Figure 1 [a]). However, this method often yields models susceptible to adversarial attacks, like "jailbreaking prompting" (Subhash, 2023; Xu et al., 2021), due to limited exposure to misaligned data during training (Amodei et al., 2016). To address this, a more advanced technique, "reward modeling" has been proposed (Leike et al., 2018; Christiano et al., 2017). This involves training a reward model as a surrogate for human judgment to guide the optimization of the LM (e.g., OpenAI's RLHF, Figure 1 [b]).
arXiv.org Artificial Intelligence
Oct-28-2023
- Country:
- North America > United States
- California (0.14)
- Michigan (0.14)
- North America > United States
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Government (0.86)
- Information Technology > Security & Privacy (0.86)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Media (0.66)
- Technology: