alpacafarm
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.50)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
- Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.47)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.31)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > Germany > Berlin (0.04)
- Asia > Middle East > Jordan (0.04)
- Leisure & Entertainment (0.46)
- Education (0.46)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.93)
- (4 more...)
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
Large language models (LLMs) such as ChatGPT have seen widespread adoption due to their ability to follow user instructions well.Developing these LLMs involves a complex yet poorly understood workflow requiring training with human feedback. Replicating and understanding this instruction-following process faces three major challenges: the high cost of data collection, the lack of trustworthy evaluation, and the absence of reference method implementations. We address these bottlenecks with AlpacaFarm, a simulator that enables research and development for learning from feedback at a low cost. First, we design LLM based simulator for human feedback that is 45x cheaper than crowdworkers and displays high agreement with humans. Second, we identify an evaluation dataset representative of real-world instructions and propose an automatic evaluation procedure. Third, we contribute reference implementations for several methods (PPO, best-of-n, expert iteration, among others) that learn from pairwise feedback. Finally, as an end-to-end validation of AlpacaFarm, we train and evaluate eleven models on 10k pairs of human feedback and show that rankings of models trained in AlpacaFarm match rankings of models trained on human data. As a demonstration of the research possible in AlpacaFarm, we find that methods that use a reward model can substantially improve over supervised fine-tuning and that our reference PPO implementation leads to a +10% win-rate improvement against Davinci003.
A Reference LPF methods on AlpacaFarm 568 Having defined and validated the pairwise feedback simulator and evaluations in AlpacaFarm, we
A.1 Methods that directly learn from pairwise feedback To start, we describe the step of training the surrogate reward model. We adapt this approach in AlpacaFarm as a two-step method. In Appendix F, we include our preliminary study of multi-round expert iteration. We find exactly this result with the simulator. Figure 5: Our simulated annotators are cheap and match well with human annotators.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.50)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
- Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.47)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.31)
- North America > Canada > Ontario > Toronto (0.14)
- Europe > France (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (2 more...)
- Government (0.67)
- Leisure & Entertainment (0.67)
- Education (0.67)
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
Large language models (LLMs) such as ChatGPT have seen widespread adoption due to their ability to follow user instructions well.Developing these LLMs involves a complex yet poorly understood workflow requiring training with human feedback. Replicating and understanding this instruction-following process faces three major challenges: the high cost of data collection, the lack of trustworthy evaluation, and the absence of reference method implementations. We address these bottlenecks with AlpacaFarm, a simulator that enables research and development for learning from feedback at a low cost. First, we design LLM based simulator for human feedback that is 45x cheaper than crowdworkers and displays high agreement with humans. Second, we identify an evaluation dataset representative of real-world instructions and propose an automatic evaluation procedure.
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
Large language models (LLMs) such as ChatGPT have seen widespread adoption due to their ability to follow user instructions well.Developing these LLMs involves a complex yet poorly understood workflow requiring training with human feedback. Replicating and understanding this instruction-following process faces three major challenges: the high cost of data collection, the lack of trustworthy evaluation, and the absence of reference method implementations. We address these bottlenecks with AlpacaFarm, a simulator that enables research and development for learning from feedback at a low cost. First, we design LLM based simulator for human feedback that is 45x cheaper than crowdworkers and displays high agreement with humans. Second, we identify an evaluation dataset representative of real-world instructions and propose an automatic evaluation procedure.
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling
Qiu, Jiahao, Lu, Yifu, Zeng, Yifan, Guo, Jiacheng, Geng, Jiayi, Wang, Huazheng, Huang, Kaixuan, Wu, Yue, Wang, Mengdi
Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning but presents challenges due to balancing computational efficiency with high-quality output. Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one, achieving improved performance but with a high computational cost. We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling. TreeBoN maintains a set of parent nodes, iteratively branching and pruning low-quality responses, thereby reducing computational overhead while maintaining high output quality. Our approach also leverages token-level rewards from Direct Preference Optimization (DPO) to guide tree expansion and prune low-quality paths. We evaluate TreeBoN using AlpacaFarm, HH-RLHF, UltraFeedback, GSM8K, and TutorEval datasets, demonstrating consistent improvements. Specifically, TreeBoN achieves the highest win rate of 65% on TutorEval and around 60% win rates across other different datasets, outperforming standard BoN with the same computational cost and showcasing its scalability and alignment efficacy.
- North America > United States > Oregon (0.04)
- North America > United States > Michigan (0.04)
- Europe > Spain > Aragón (0.04)
- Asia > Middle East > Jordan (0.04)
Fine-Tuning Language Models with Reward Learning on Policy
Lang, Hao, Huang, Fei, Li, Yongbin
Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences. RLHF contains three steps, i.e., human preference collecting, reward learning, and policy optimization, which are usually performed serially. Despite its popularity, however, (fixed) reward models may suffer from inaccurate off-distribution, since policy optimization continuously shifts LLMs' data distribution. Repeatedly collecting new preference data from the latest LLMs may alleviate this issue, which unfortunately makes the resulting system more complicated and difficult to optimize. In this paper, we propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution. Specifically, an unsupervised multi-view learning method is introduced to learn robust representations of policy samples. Meanwhile, a synthetic preference generation approach is developed to simulate high-quality preference data with policy outputs. Extensive experiments on three benchmark datasets show that RLP consistently outperforms the state-of-the-art. Our code is available at \url{https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/rlp}.
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
Dubois, Yann, Li, Xuechen, Taori, Rohan, Zhang, Tianyi, Gulrajani, Ishaan, Ba, Jimmy, Guestrin, Carlos, Liang, Percy, Hashimoto, Tatsunori B.
Large language models (LLMs) such as ChatGPT have seen widespread adoption due to their strong instruction following abilities. Developing these LLMs involves a complex yet poorly understood workflow requiring training with human feedback. Replicating and understanding this instruction-following requires tackling three major challenges: the high cost of data collection, the lack of trustworthy evaluation, and the absence of reference method implementations. We address these challenges with AlpacaFarm, a simulator that enables research and development for learning from feedback at a low cost. First, we design LLM prompts to simulate human feedback that are 50x cheaper than crowdworkers and display high agreement with humans. Second, we propose an automatic evaluation and validate it against human instructions obtained on real-world interactions. Third, we contribute reference implementations for several methods (PPO, best-of-n, expert iteration, and more) that learn from pairwise feedback. Finally, as an end-to-end validation of AlpacaFarm, we train and evaluate eleven models on 10k pairs of real human feedback and show that the rankings of models trained in AlpacaFarm match the rankings of models trained on human data. As a demonstration of the research possible in AlpacaFarm, we find that methods that use a reward model can substantially improve over supervised fine-tuning and that our reference PPO implementation leads to a +10% improvement in win-rate against Davinci003.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > France (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (2 more...)
- Government (0.67)
- Leisure & Entertainment (0.67)
- Education (0.45)