Min, Rui
Improving Your Model Ranking on Chatbot Arena by Vote Rigging
Min, Rui, Pang, Tianyu, Du, Chao, Liu, Qian, Cheng, Minhao, Lin, Min
Chatbot Arena is a popular platform for evaluating LLMs by pairwise battles, where users vote for their preferred response from two randomly sampled anonymous models. While Chatbot Arena is widely regarded as a reliable LLM ranking leaderboard, we show that crowdsourced voting can be rigged to improve (or decrease) the ranking of a target model $m_{t}$. We first introduce a straightforward target-only rigging strategy that focuses on new battles involving $m_{t}$, identifying it via watermarking or a binary classifier, and exclusively voting for $m_{t}$ wins. However, this strategy is practically inefficient because there are over $190$ models on Chatbot Arena and on average only about $1\%$ of new battles will involve $m_{t}$. To overcome this, we propose omnipresent rigging strategies, exploiting the Elo rating mechanism of Chatbot Arena that any new vote on a battle can influence the ranking of the target model $m_{t}$, even if $m_{t}$ is not directly involved in the battle. We conduct experiments on around $1.7$ million historical votes from the Chatbot Arena Notebook, showing that omnipresent rigging strategies can improve model rankings by rigging only hundreds of new votes. While we have evaluated several defense mechanisms, our findings highlight the importance of continued efforts to prevent vote rigging. Our code is available at https://github.com/sail-sg/Rigging-ChatbotArena.
WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages
Yu, Jia, Yuan, Fei, Min, Rui, Yu, Jing, Chu, Pei, Li, Jiayang, Li, Wei, Zhang, Ruijie, Li, Zhenxiang, Ren, Zhifei, Zheng, Dong, Zhang, Wenjian, Teng, Yan, Meng, Lingyu, Jin, ZhenJiang, Qiu, Jiantao, Wang, ShaSha, Tu, Zhongying, Lin, Dahua, Wang, Yu, Qiao, Yu, Wang, Yanfeng, He, Conghui
This paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages, thereby advancing the research and development of multilingual models. To achieve this, we have developed a systematic data processing framework tailored for low-resource languages. This framework encompasses key stages such as data extraction, corpus cleaning, content deduplication, security filtering, quality evaluation, and theme classification. Through the implementation of this framework, we have significantly improved both the quality and security of the dataset, while maintaining its linguistic diversity. As of now, data for all five languages have been fully open-sourced.
Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament
Liu, Yantao, Yao, Zijun, Min, Rui, Cao, Yixin, Hou, Lei, Li, Juanzi
Best-of-N (BoN) sampling, a common strategy for test-time scaling of Large Language Models (LLMs), relies on reward models to select the best candidate solution from multiple generations. However, traditional reward models often assign arbitrary and inconsistent scores, limiting their effectiveness. To address this, we propose a Pairwise Reward Model (Pairwise RM) combined with a knockout tournament for BoN sampling. Instead of assigning absolute scores, given one math problem, Pairwise RM evaluates two candidate solutions' correctness simultaneously. This approach eliminates the need for arbitrary scoring and enables cross-validation of solutions through parallel comparison. In the knockout tournament, Pairwise RM conducts pairwise comparisons between candidate solutions and eliminates the incorrect ones iteratively. We construct \ourdataset, a large-scale dataset of 443K pairwise comparisons derived from NumiaMath and annotated using \texttt{gemini-1.5-flash}, and train the Pairwise RM via supervised fine-tuning. Experiments on MATH-500 and the Olympiad Bench demonstrate significant improvements over traditional discriminative reward models. And a 40\% to 60\% relative improvement is achieved on the top 50\% challenging problems.
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
Liu, Yantao, Yao, Zijun, Min, Rui, Cao, Yixin, Hou, Lei, Li, Juanzi
Reward models are critical in techniques like Reinforcement Learning from Human Feedback (RLHF) and Inference Scaling Laws, where they guide language model alignment and select optimal responses. Despite their importance, existing reward model benchmarks often evaluate models by asking them to distinguish between responses generated by models of varying power. However, this approach fails to assess reward models on subtle but critical content changes and variations in style, resulting in a low correlation with policy model performance. Reward models play a pivotal role in both techniques. In RLHF, reward models serve as proxies for human values, providing feedback on generated text, which helps align language models (policy models) during training (Ouyang et al., 2022; Dong et al., 2024). In Inference Scaling Law, reward models are used to select the best response from a set of candidates based on predicted rewards (Wu et al., 2024; Snell et al., 2024). Despite their significance, benchmarks for reward models remain under-explored compared to the rapid advancements in aligned language model evaluation, namely the policy model (Hendrycks et al., 2020; bench authors, 2023; Chiang et al., 2024; Hendrycks et al., 2021). To conduct a faithful and systematical evaluation, an ideal benchmark for reward models should adhere to three key principles: 1) Assessing Reward Models' Sensitivity to Subtle Changes: A faithful reward model should sensitively distinguish subtle changes and assign a higher reward to the correct response. For example, in Table 1, Response 1 and Response 2 differ by only one word but express completely different meanings, requiring the reward model to focus on content quality. For example, in Table 1, Response 3 is factually incorrect but longer than Response 1, which could mislead the reward model into assigning a higher reward to Response 3. 3) Correlating with Policy Models: A good reward model benchmark should highly correlate with the performance of the aligned language model (the policy model). This would make it a reliable proxy for selecting the best reward model for alignment. Recent efforts (Lambert et al., 2024; Zhu et al., 2023; Jiang et al., 2023) have made progress by constructing benchmarks from existing preference datasets.
Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense
Min, Rui, Qin, Zeyu, Zhang, Nevin L., Shen, Li, Cheng, Minhao
Backdoor attacks pose a significant threat to Deep Neural Networks (DNNs) as they allow attackers to manipulate model predictions with backdoor triggers. To address these security vulnerabilities, various backdoor purification methods have been proposed to purify compromised models. Typically, these purified models exhibit low Attack Success Rates (ASR), rendering them resistant to backdoored inputs. However, Does achieving a low ASR through current safety purification methods truly eliminate learned backdoor features from the pretraining phase? In this paper, we provide an affirmative answer to this question by thoroughly investigating the Post-Purification Robustness of current backdoor purification methods. We find that current safety purification methods are vulnerable to the rapid re-learning of backdoor behavior, even when further fine-tuning of purified models is performed using a very small number of poisoned samples. Based on this, we further propose the practical Query-based Reactivation Attack (QRA) which could effectively reactivate the backdoor by merely querying purified models. We find the failure to achieve satisfactory post-purification robustness stems from the insufficient deviation of purified models from the backdoored model along the backdoor-connected path. To improve the post-purification robustness, we propose a straightforward tuning defense, Path-Aware Minimization (PAM), which promotes deviation along backdoor-connected paths with extra model updates. Extensive experiments demonstrate that PAM significantly improves post-purification robustness while maintaining a good clean accuracy and low ASR. Our work provides a new perspective on understanding the effectiveness of backdoor safety tuning and highlights the importance of faithfully assessing the model's safety.
Towards Stable Backdoor Purification through Feature Shift Tuning
Min, Rui, Qin, Zeyu, Shen, Li, Cheng, Minhao
It has been widely observed that deep neural networks (DNN) are vulnerable to backdoor attacks where attackers could manipulate the model behavior maliciously by tampering with a small set of training samples. Although a line of defense methods is proposed to mitigate this threat, they either require complicated modifications to the training process or heavily rely on the specific model architecture, which makes them hard to deploy into real-world applications. Therefore, in this paper, we instead start with fine-tuning, one of the most common and easy-to-deploy backdoor defenses, through comprehensive evaluations against diverse attack scenarios. Observations made through initial experiments show that in contrast to the promising defensive results on high poisoning rates, vanilla tuning methods completely fail at low poisoning rate scenarios. Our analysis shows that with the low poisoning rate, the entanglement between backdoor and clean features undermines the effect of tuning-based defenses. Therefore, it is necessary to disentangle the backdoor and clean features in order to improve backdoor purification. To address this, we introduce Feature Shift Tuning (FST), a method for tuning-based backdoor purification. Specifically, FST encourages feature shifts by actively deviating the classifier weights from the originally compromised weights. Extensive experiments demonstrate that our FST provides consistently stable performance under different attack settings. Without complex parameter adjustments, FST also achieves much lower tuning costs, only 10 epochs. Our codes are available at https://github.com/AISafety-HKUST/stable_backdoor_purification.