bpo
Preference Optimization by Estimating the Ratio of the Data Distribution
Kim, Yeongmin, Bae, Heesun, Na, Byeonghu, Moon, Il-Chul
Direct preference optimization (DPO) is widely used as a simple and stable method for aligning large language models (LLMs) with human preferences. This paper investigates a generalized DPO loss that enables a policy model to match the target policy from a likelihood ratio estimation perspective. The ratio of the target policy provides a unique identification of the policy distribution without relying on reward models or partition functions. This allows the generalized loss to retain both simplicity and theoretical guarantees, which prior work such as $f$-PO fails to achieve simultaneously. We propose Bregman preference optimization (BPO), a generalized framework for ratio matching that provides a family of objective functions achieving target policy optimality. BPO subsumes DPO as a special case and offers tractable forms for all instances, allowing implementation with a few lines of code. We further develop scaled Basu's power divergence (SBA), a gradient scaling method that can be used for BPO instances. The BPO framework complements other DPO variants and is applicable to target policies defined by these variants. In experiments, unlike other probabilistic loss extensions such as $f$-DPO or $f$-PO, which exhibit a trade-off between generation fidelity and diversity, instances of BPO improve both win rate and entropy compared with DPO. When applied to Llama-3-8B-Instruct, BPO achieves state-of-the-art performance among Llama-3-8B backbones, with a 55.9\% length-controlled win rate on AlpacaEval2. Project page: https://github.com/aailab-kaist/BPO.
- North America > United States > Texas (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Law (1.00)
- Government > Tax (0.67)
BPO: Revisiting Preference Modeling in Direct Preference Optimization
Sun, Lin, Liu, Chuang, Liu, Peng, Li, Bingyang, Lu, Weijia, Wu, Ning
Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through pairwise ranking losses, it often neglects absolute reward magnitudes. This oversight can decrease the likelihood of chosen responses and increase the risk of generating out-of-distribution responses, leading to poor performance. We term this issue Degraded Chosen Responses (DCR).To address this issue, we propose Balanced Preference Optimization (BPO), a novel framework that dynamically balances the optimization of chosen and rejected responses through two key components: balanced reward margin and gap adaptor. Unlike previous methods, BPO can fundamentally resolve DPO's DCR issue, without introducing additional constraints to the loss function. Experimental results on multiple mathematical reasoning tasks show that BPO significantly outperforms DPO, improving accuracy by +10.1% with Llama-3.1-8B-Instruct (18.8% to 28.9%) and +11.7% with Qwen2.5-Math-7B (35.0% to 46.7%). It also surpasses DPO variants by +3.6% over IPO (43.1%), +5.0% over SLiC (41.7%), and +3.1% over Cal-DPO (43.6%) on the same model. Remarkably, our algorithm requires only a single line of code modification, making it simple to implement and fully compatible with existing DPO-based frameworks.
- Europe > Austria > Vienna (0.15)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- (6 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.68)
BPO: Towards Balanced Preference Optimization between Knowledge Breadth and Depth in Alignment
Wang, Sizhe, Tong, Yongqi, Zhang, Hengyuan, Li, Dawei, Zhang, Xin, Chen, Tianlong
Reinforcement Learning with Human Feedback (RLHF) is the key to the success of large language models (LLMs) in recent years. In this work, we first introduce the concepts of knowledge breadth and knowledge depth, which measure the comprehensiveness and depth of an LLM or knowledge source respectively. We reveal that the imbalance in the number of prompts and responses can lead to a potential disparity in breadth and depth learning within alignment tuning datasets by showing that even a simple uniform method for balancing the number of instructions and responses can lead to significant improvements. Building on this, we further propose Balanced Preference Optimization (BPO), designed to dynamically augment the knowledge depth of each sample. BPO is motivated by the observation that the usefulness of knowledge varies across samples, necessitating tailored learning of knowledge depth. To achieve this, we introduce gradient-based clustering, estimating the knowledge informativeness and usefulness of each augmented sample based on the model's optimization direction. Our experimental results across various benchmarks demonstrate that BPO outperforms other baseline methods in alignment tuning while maintaining training efficiency. Furthermore, we conduct a detailed analysis of each component of BPO, providing guidelines for future research in preference data optimization.
- North America > United States > California (0.14)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- North America > United States > North Carolina (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.91)
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM
Xu, Wenda, Li, Jiachen, Wang, William Yang, Li, Lei
Direct alignment from preferences (DAP) has emerged as a promising paradigm for aligning large language models (LLMs) to human desiderata from pre-collected, offline preference datasets. While recent studies indicate that existing offline DAP methods can directly benefit from online training samples, we highlight the need to develop specific online DAP algorithms to fully harness the power of online training. Specifically, we identify that the learned LLM should adhere to the proximity of the behavior LLM, which collects the training samples. To this end, we propose online Preference Optimization in proximity to the Behavior LLM (BPO), emphasizing the importance of constructing a proper trust region for LLM alignment. We conduct extensive experiments to validate the effectiveness and applicability of our approach by integrating it with various DAP methods, resulting in significant performance improvements across a wide range of tasks when training with the same amount of preference data. Even when only introducing one additional data collection phase, our online BPO improves its offline DAP baseline from 72.0% to 80.2% on TL;DR and from 82.2% to 89.1% on Anthropic Helpfulness in terms of win rate against human reference text.
- North America > Canada > Ontario (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > California > San Bernardino County > Ontario (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.46)
- Instructional Material > Online (0.41)
- Health & Medicine > Therapeutic Area (1.00)
- Education > Educational Setting > Online (1.00)
Do Transformer World Models Give Better Policy Gradients?
Ma, Michel, Ni, Tianwei, Gehring, Clement, D'Oro, Pierluca, Bacon, Pierre-Luc
A natural approach for reinforcement learning is to predict future rewards by unrolling a neural network world model, and to backpropagate through the resulting computational graph to learn a policy. However, this method often becomes impractical for long horizons since typical world models induce hard-to-optimize loss landscapes. Transformers are known to efficiently propagate gradients over long horizons: could they be the solution to this problem? Surprisingly, we show that commonly-used transformer world models produce circuitous gradient paths, which can be detrimental to long-range policy gradients. To tackle this challenge, we propose a class of world models called Actions World Models (AWMs), designed to provide more direct routes for gradient propagation. We integrate such AWMs into a policy gradient framework that underscores the relationship between network architectures and the policy gradient updates they inherently represent. We demonstrate that AWMs can generate optimization landscapes that are easier to navigate even when compared to those from the simulator itself. This property allows transformer AWMs to produce better policies than competitive baselines in realistic long-horizon tasks.
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (5 more...)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.47)
- Health & Medicine > Therapeutic Area > Immunology (0.47)
- Health & Medicine > Therapeutic Area > Oncology (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
How to Split a Logic Program
Answer Set Programming (ASP) is a successful method for solving a range of real-world applications. Despite the availability of fast ASP solvers, computing answer sets demands a very large computational power, since the problem tackled is in the second level of the polynomial hierarchy. A speed-up in answer set computation may be attained, if the program can be split into two disjoint parts, bottom and top. Thus, the bottom part is evaluated independently of the top part, and the results of the bottom part evaluation are used to simplify the top part. Lifschitz and Turner have introduced the concept of a splitting set, i.e., a set of atoms that defines the splitting. In this paper, We show that the problem of computing a splitting set with some desirable properties can be reduced to a classic Search Problem and solved in polynomial time. This allows us to conduct experiments on the size of the splitting set in various programs and lead to an interesting discovery of a source of complication in stable model computation. We also show that for Head-Cycle-Free programs, the definition of splitting sets can be adjusted to allow splitting of a broader class of programs.
- North America > United States > Massachusetts > Middlesex County > Reading (0.04)
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
Applying Digital Tools to Outsourced Services
In my line of work, I'm often asked what the is difference between a business process outsourcer (BPO) and a knowledge process outsourcer (KPO)? So much of that comes down to technology. One key distinction is that BPOs focus more on simple automation of repetitive tasks. Some BPOs will rely on human capital to perform these tasks, while others invest in robotic process automation (RPA) to speed things up. KPOs differ from BPOs in that they provide their customer base with much deeper domain expertise and a much higher-value level of analysis.
Emerging technologies as a business opportunity for the Kosovo IT & BPO outsourcing companies -
Today's rapid pace of technological change has fundamentally transformed the global outsourcing scene, making outsourcing part of every successful company's strategy. Properly developed, strategic outsourcing substantially lowers costs, risks, and fixed investments while greatly expanding flexibility, innovative capabilities, and opportunities for creating higher value-added and shareholder returns. Traditionally, the main driving factor behind IT & BPO outsourcing was cost-reduction. But lately, apart from cost-reduction, global companies outsource to access knowledge, talent, innovation and expertise that is available and ready to be put into use. It's a knowledge economy, and in knowledge economies global companies gain access to global capabilities and access global knowledge with the aim to stay current, innovate or transform their companies.
BPOs must innovate to tackle artificial intelligence threat - The Manila Times Online
ARTIFICIAL intelligence (AI) could wipe out thousands of jobs in the country's fast-growing business process outsourcing (BPO) sector, but the industry can address this threat through innovation and training, Trade Secretary Ramon Lopez said on Wednesday. "One solution to this technology's threat is to modify and remodel the existing jobs' nature. Another way is the organization of groups necessary in the infusion of science and technology to other fields or areas for the realization of establishing a comprehensive network," Lopez said. The most important aspect, he said, is to conduct training and related activities. "In simpler terms, we will endeavor to make this new technology work for them," Lopez said.