Goto

Collaborating Authors

 average reward


Direct Alignment with Heterogeneous Preferences

Neural Information Processing Systems

Alignment with human preferences is commonly framed using a universal reward function, even though human preferences are inherently heterogeneous. We formalize this heterogeneity by introducing user types and examine the limits of the homogeneity assumption. We show that aligning to heterogeneous preferences with a single policy is best achieved using the average reward across user types. However, this requires additional information about annotators. We examine improvements under different information settings, focusing on direct alignment methods. We find that minimal information can yield first-order improvements, while full feedback from each user type leads to consistent learning of the optimal policy. Surprisingly, however, no sample-efficient consistent direct loss exists in this latter setting. These results reveal a fundamental tension between consistency and sample efficiency in direct policy alignment.


Bandits attack function optimization

arXiv.org Machine Learning

We consider function optimization as a sequential decision making problem under budget constraint. This constraint limits the number of objective function evaluations allowed during the optimization. We consider an algorithm inspired by a continuous version of a multi-armed bandit problem which attacks this optimization problem by solving the tradeoff between exploration (initial quasi-uniform search of the domain) and exploitation (local optimization around the potentially global maxima). We introduce the so-called Simultaneous Optimistic Optimization (SOO), a deterministic algorithm that works by domain partitioning. The benefit of such approach are the guarantees on the returned solution and the numerical efficiency of the algorithm. We present this machine learning approach to optimization, and provide the empirical assessment of SOO on the CEC'2014 competition on single objective real-parameter numerical optimization test-suite.







Appendix: Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms

Neural Information Processing Systems

Thus the optimal average reward of the original MDP and modified MDP differ by O ( ฯต). To ensure Assumption 3.1 (b) is satisfied, an aperiodicity transformation can be implemented. The proof of this theorem can be found in [Sch71]. From Lemma 2.2, we thus have, ( J In order to iterate Equation (8), need to ensure the terms are non-negative. Theorem 3.3 presents an upper bound on the error in terms of the average reward.



R-learninginactor-criticmodeloffersabiologically relevantmechanismforsequentialdecision-making

Neural Information Processing Systems

Afewstudies haveexplored sequential stay-or-leavedecisions in humans, or rodents - the model organism used to access neuronal activity at high resolution. In both cases, decision patterns were collected inforaging tasks-the experimental settings where subjects decide when to leave depleting resources (2).