Fu, Jiayi
Reward Shaping to Mitigate Reward Hacking in RLHF
Fu, Jiayi, Zhao, Xuandong, Yao, Chengyuan, Wang, Heng, Han, Qi, Xiao, Yanghua
Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to reward hacking, where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. While reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests three key design principles: (1) RL reward is ideally bounded, (2) RL benefits from rapid initial growth followed by gradual convergence, and (3) RL reward is best formulated as a function of centered reward. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model itself as the signal for reinforcement learning. We evaluated PAR on two base models, Gemma2-2B and Llama3-8B, using two datasets, Ultrafeedback-Binarized and HH-RLHF. Experimental results demonstrate PAR's superior performance over other reward shaping methods. On the AlpacaEval 2.0 benchmark, PAR achieves a win rate at least 5 percentage points higher than competing approaches. Furthermore, PAR exhibits remarkable data efficiency, requiring only a single reference reward for optimal performance, and maintains robustness against reward hacking even after two full epochs of training. Code is available at https://github.com/PorUna-byte/PAR.
SocialGaze: Improving the Integration of Human Social Norms in Large Language Models
Vijjini, Anvesh Rao, Menon, Rakesh R., Fu, Jiayi, Srivastava, Shashank, Chaturvedi, Snigdha
While much research has explored enhancing the reasoning capabilities of large language models (LLMs) in the last few years, there is a gap in understanding the alignment of these models with social values and norms. We introduce the task of judging social acceptance. Social acceptance requires models to judge and rationalize the acceptability of people's actions in social situations. For example, is it socially acceptable for a neighbor to ask others in the community to keep their pets indoors at night? We find that LLMs' understanding of social acceptance is often misaligned with human consensus. To alleviate this, we introduce SocialGaze, a multi-step prompting framework, in which a language model verbalizes a social situation from multiple perspectives before forming a judgment. Our experiments demonstrate that the SocialGaze approach improves the alignment with human judgments by up to 11 F1 points with the GPT-3.5 model. We also identify biases and correlations in LLMs in assigning blame that is related to features such as the gender (males are significantly more likely to be judged unfairly) and age (LLMs are more aligned with humans for older narrators).
GumbelSoft: Diversified Language Model Watermarking via the GumbelMax-trick
Fu, Jiayi, Zhao, Xuandong, Yang, Ruihan, Zhang, Yuansen, Chen, Jiangjie, Xiao, Yanghua
Large language models (LLMs) excellently generate human-like text, but also raise concerns about misuse in fake news and academic dishonesty. Decoding-based watermark, particularly the GumbelMax-trick-based watermark(GM watermark), is a standout solution for safeguarding machine-generated texts due to its notable detectability. However, GM watermark encounters a major challenge with generation diversity, always yielding identical outputs for the same prompt, negatively impacting generation diversity and user experience. To overcome this limitation, we propose a new type of GM watermark, the Logits-Addition watermark, and its three variants, specifically designed to enhance diversity. Among these, the GumbelSoft watermark (a softmax variant of the Logits-Addition watermark) demonstrates superior performance in high diversity settings, with its AUROC score outperforming those of the two alternative variants by 0.1 to 0.3 and surpassing other decoding-based watermarking methods by a minimum of 0.1.
Ask One More Time: Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios
Lin, Lei, Fu, Jiayi, Liu, Pengli, Li, Qingyang, Gong, Yan, Wan, Junchen, Zhang, Fuzheng, Wang, Zhongyuan, Zhang, Di, Gai, Kun
Although chain-of-thought (CoT) prompting combined with language models has achieved encouraging results on complex reasoning tasks, the naive greedy decoding used in CoT prompting usually causes the repetitiveness and local optimality. To address this shortcoming, ensemble-optimization tries to obtain multiple reasoning paths to get the final answer assembly. However, current ensemble-optimization methods either simply employ rule-based post-processing such as \textit{self-consistency}, or train an additional model based on several task-related human annotations to select the best one among multiple reasoning paths, yet fail to generalize to realistic settings where the type of input questions is unknown or the answer format of reasoning paths is unknown. To avoid their limitations, we propose \textbf{Self-Agreement}, a generalizable ensemble-optimization method applying in almost all scenarios where the type of input questions and the answer format of reasoning paths may be known or unknown. Self-agreement firstly samples from language model's decoder to generate a \textit{diverse} set of reasoning paths, and subsequently prompts the language model \textit{one more time} to determine the optimal answer by selecting the most \textit{agreed} answer among the sampled reasoning paths. Self-agreement simultaneously achieves remarkable performance on six public reasoning benchmarks and superior generalization capabilities.
KwaiYiiMath: Technical Report
Fu, Jiayi, Lin, Lei, Gao, Xiaoyang, Liu, Pengli, Chen, Zhengzong, Yang, Zhirui, Zhang, Shengnan, Zheng, Xue, Li, Yan, Liu, Yuliang, Ye, Xucheng, Liao, Yiqiao, Liao, Chao, Chen, Bin, Song, Chengru, Wan, Junchen, Lin, Zijia, Zhang, Fuzheng, Wang, Zhongyuan, Zhang, Di, Gai, Kun
Recent advancements in large language models (LLMs) have demonstrated remarkable abilities in handling a variety of natural language processing (NLP) downstream tasks, even on mathematical tasks requiring multi-step reasoning. Meanwhile, we also constructed a small-scale Chinese primary school mathematics test set (named KMath), consisting of 188 examples to evaluate the correctness of the problem-solving process generated by the models. Empirical studies demonstrate that KwaiYiiMath can achieve stateof-the-art (SOTA) performance on GSM8k, CMath, and KMath compared with the similar size models, respectively. Recent advances in large language models (LLMs) have revolutionized the natural language processing (NLP) landscape Kenton & Toutanova (2019); Brown et al. (2020), where scaling up model size and the amount of data is one of the key ingredients Rae et al. (2021); Chowdhery et al. (2022); Anil et al. (2023); Touvron et al. (2023a;b). Surprisingly, recent progress suggests that LLMs also have the potential to solve reasoning problems Clark et al. (2020); Talmor et al. (2020); Suzgun et al. (2022); Wei et al. (2022b). In this report, we focus on how to enhance the mathematical reasoning capabilities of LLM through an alignment process that includes supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). Specifically, we introduce the KwaiYiiMath which is finetuned with human alignment techniques from KwaiYiiBase to tackle mathematical problems. Experimental results show that KwaiYiiMath outperforms many open-source models in similar sizes by a large margin and is approaching GPT-4 on three mathematical benchmarks including both English and Chinese, i.e., GSM8k Cobbe et al. (2021), CMath Wei et al. (2023), and a small-scale in-house dataset KMath. KwaiYiiBase is a large language model developed by Kuaishou https://github.com/kwai/KwaiYii/. Section 3 introduces the methodology of KwaiYiiMath including the process of supervised fine-tuning and human preference alignment. Additionally, it also describes details about the efforts in collecting large amounts of mathematical high-quality training data.