pbo
Efficient Reinforcement Learning from Human Feedback via Bayesian Preference Inference
Cercola, Matteo, Capretti, Valeria, Formentin, Simone
Learning from human preferences is a cornerstone of aligning machine learning models with subjective human judgments. Yet, collecting such preference data is often costly and time-consuming, motivating the need for more efficient learning paradigms. Two established approaches offer complementary advantages: RLHF scales effectively to high-dimensional tasks such as LLM fine-tuning, while PBO achieves greater sample efficiency through active querying. We propose a hybrid framework that unifies RLHF's scalability with PBO's query efficiency by integrating an acquisition-driven module into the RLHF pipeline, thereby enabling active and sample-efficient preference gathering. We validate the proposed approach on two representative domains: (i) high-dimensional preference optimization and (ii) LLM fine-tuning. Experimental results demonstrate consistent improvements in both sample efficiency and overall performance across these tasks.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > Canada > British Columbia (0.04)
- Europe > Italy > Lombardy > Milan (0.04)
Exploiting Prior Knowledge in Preferential Learning of Individualized Autonomous Vehicle Driving Styles
Theiner, Lukas, Hirt, Sebastian, Steinke, Alexander, Findeisen, Rolf
Trajectory planning for automated vehicles commonly employs optimization over a moving horizon - Model Predictive Control - where the cost function critically influences the resulting driving style. However, finding a suitable cost function that results in a driving style preferred by passengers remains an ongoing challenge. We employ preferential Bayesian optimization to learn the cost function by iteratively querying a passenger's preference. Due to increasing dimensionality of the parameter space, preference learning approaches might struggle to find a suitable optimum with a limited number of experiments and expose the passenger to discomfort when exploring the parameter space. We address these challenges by incorporating prior knowledge into the preferential Bayesian optimization framework. Our method constructs a virtual decision maker from real-world human driving data to guide parameter sampling. In a simulation experiment, we achieve faster convergence of the prior-knowledge-informed learning procedure compared to existing preferential Bayesian optimization approaches and reduce the number of inadequate driving styles sampled.
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Transportation > Ground > Road (1.00)
- Automobiles & Trucks (1.00)
Preferential Multi-Objective Bayesian Optimization
Astudillo, Raul, Li, Kejun, Tucker, Maegan, Cheng, Chu Xin, Ames, Aaron D., Yue, Yisong
Preferential Bayesian optimization (PBO) is a framework for optimizing a decision-maker's latent preferences over available design choices. While preferences often involve multiple conflicting objectives, existing work in PBO assumes that preferences can be encoded by a single objective function. For example, in robotic assistive devices, technicians often attempt to maximize user comfort while simultaneously minimizing mechanical energy consumption for longer battery life. Similarly, in autonomous driving policy design, decision-makers wish to understand the trade-offs between multiple safety and performance attributes before committing to a policy. To address this gap, we propose the first framework for PBO with multiple objectives. Within this framework, we present dueling scalarized Thompson sampling (DSTS), a multi-objective generalization of the popular dueling Thompson algorithm, which may be of interest beyond the PBO setting. We evaluate DSTS across four synthetic test functions and two simulated exoskeleton personalization and driving policy design tasks, showing that it outperforms several benchmarks. Finally, we prove that DSTS is asymptotically consistent. As a direct consequence, this result provides, to our knowledge, the first convergence guarantee for dueling Thompson sampling in the PBO setting.
- North America > United States > California (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.69)
Parameterized Projected Bellman Operator
Vincent, Théo, Metelli, Alberto Maria, Belousov, Boris, Peters, Jan, Restelli, Marcello, D'Eramo, Carlo
Approximate value iteration~(AVI) is a family of algorithms for reinforcement learning~(RL) that aims to obtain an approximation of the optimal value function. Generally, AVI algorithms implement an iterated procedure where each step consists of (i) an application of the Bellman operator and (ii) a projection step into a considered function space. Notoriously, the Bellman operator leverages transition samples, which strongly determine its behavior, as uninformative samples can result in negligible updates or long detours, whose detrimental effects are further exacerbated by the computationally intensive projection step. To address these issues, we propose a novel alternative approach based on learning an approximate version of the Bellman operator rather than estimating it through samples as in AVI approaches. This way, we are able to (i) generalize across transition samples and (ii) avoid the computationally intensive projection step. For this reason, we call our novel operator projected Bellman operator (PBO). We formulate an optimization problem to learn PBO for generic sequential decision-making problems, and we theoretically analyze its properties in two representative classes of RL problems. Furthermore, we theoretically study our approach under the lens of AVI and devise algorithmic implementations to learn PBO in offline and online settings by leveraging neural network parameterizations. Finally, we empirically showcase the benefits of PBO w.r.t. the regular Bellman operator on several RL problems.
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.05)
- Europe > Germany > Bavaria > Lower Franconia > Würzburg (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)
Preferential Bayesian optimisation with Skew Gaussian Processes
Benavoli, Alessio, Azzimonti, Dario, Piga, Dario
Bayesian optimisation (BO) is a very effective approach for sequential black-box optimization where direct queries of the objective function are expensive. However, there are cases where the objective function can only be accessed via preference judgments, such as "this is better than that" between two candidate solutions (like in A/B tests or recommender systems). The state-of-the-art approach to Preferential Bayesian Optimization (PBO) uses a Gaussian process to model the preference function and a Bernoulli likelihood to model the observed pairwise comparisons. Laplace's method is then employed to compute posterior inferences and, in particular, to build an appropriate acquisition function. In this paper, we prove that the true posterior distribution of the preference function is a Skew Gaussian Process (SkewGP), with highly skewed pairwise marginals and, thus, show that Laplace's method usually provides a very poor approximation. We then derive an efficient method to compute the exact SkewGP posterior and use it as surrogate model for PBO employing standard acquisition functions (Upper Credible Bound, etc.). We illustrate the benefits of our exact PBO-SkewGP in a variety of experiments, by showing that it consistently outperforms PBO based on Laplace's approximation both in terms of convergence speed and computational time. We also show that our framework can be extended to deal with mixed preferential-categorical BO, typical for instance in smart manufacturing, where binary judgments (valid or non-valid) together with preference judgments are available.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- (5 more...)