Goto

Collaborating Authors

 dim







RA-PbRL: Provably Efficient Risk-Aware Preference-Based Reinforcement Learning

Neural Information Processing Systems

Reinforcement Learning from Human Feedback (RLHF) has recently surged in popularity, particularly for aligning large language models and other AI systems with human intentions. At its core, RLHF can be viewed as a specialized instance of Preference-based Reinforcement Learning (PbRL), where the preferences specifically originate from human judgments rather than arbitrary evaluators. Despite this connection, most existing approaches in both RLHF and PbRL primarily focus on optimizing a mean reward objective, neglecting scenarios that necessitate risk-awareness, such as AI safety, healthcare, and autonomous driving. These scenarios often operate under a one-episode-reward setting, which makes conventional risk-sensitive objectives inapplicable.





e92381dba235a8309f08ce46376189a9-Supplemental-Conference.pdf

Neural Information Processing Systems

We use the symmetrized cosine similarity loss from SimSiam. Model details For CIFAR10, we use pretrained StyleGAN available at the official website of StyleGAN-Ada[31]2. We also experimented with the model with best Inception score3 but did not observe significant difference in results. Linear classification The quality of the pretrained representations is evaluated by training a supervised linear classifier on frozen representationshinthe training set, and then testing itinthe validationset.