Goto

Collaborating Authors

 Wray, Kyle


Entropy-regularized Point-based Value Iteration

arXiv.org Artificial Intelligence

Model-based planners for partially observable problems must accommodate both model uncertainty during planning and goal uncertainty during objective inference. However, model-based planners may be brittle under these types of uncertainty because they rely on an exact model and tend to commit to a single optimal behavior. Inspired by results in the model-free setting, we propose an entropy-regularized model-based planner for partially observable problems. Entropy regularization promotes policy robustness for planning and objective inference by encouraging policies to be no more committed to a single action than necessary. We evaluate the robustness and objective inference performance of entropy-regularized policies in three problem domains. Our results show that entropy-regularized policies outperform non-entropy-regularized baselines in terms of higher expected returns under modeling errors and higher accuracy during objective inference.


Decision Making in Non-Stationary Environments with Policy-Augmented Search

arXiv.org Artificial Intelligence

Sequential decision-making under uncertainty is present in many important problems. Two popular approaches for tackling such problems are reinforcement learning and online search (e.g., Monte Carlo tree search). While the former learns a policy by interacting with the environment (typically done before execution), the latter uses a generative model of the environment to sample promising action trajectories at decision time. Decision-making is particularly challenging in non-stationary environments, where the environment in which an agent operates can change over time. Both approaches have shortcomings in such settings -- on the one hand, policies learned before execution become stale when the environment changes and relearning takes both time and computational effort. Online search, on the other hand, can return sub-optimal actions when there are limitations on allowed runtime. In this paper, we introduce \textit{Policy-Augmented Monte Carlo tree search} (PA-MCTS), which combines action-value estimates from an out-of-date policy with an online search using an up-to-date model of the environment. We prove theoretical results showing conditions under which PA-MCTS selects the one-step optimal action and also bound the error accrued while following PA-MCTS as a policy. We compare and contrast our approach with AlphaZero, another hybrid planning approach, and Deep Q Learning on several OpenAI Gym environments. Through extensive experiments, we show that under non-stationary settings with limited time constraints, PA-MCTS outperforms these baselines.


Active teacher selection for reinforcement learning from human feedback

arXiv.org Artificial Intelligence

Specifying objective functions for machine learning systems is challenging, and misspecified objectives can be hacked [1, 2] or incentivise degenerate behavior [3, 4, 5]. Techniques such as reinforcement learning from human feedback (RLHF) enable ML systems to instead learn appropriate objectives from human feedback [6, 7, 8]. These techniques are widely used to finetune large language models [9, 10, 11, 12] and to train reinforcement learning agents to perform complex maneuvers in continuous control environments [6, 7]. However, while RLHF is relied upon to ensure that these systems are safe, helpful, and harmless [13], it still faces many limitations and unsolved challenges [14]. In particular, RLHF systems typically rely on the assumption that all feedback comes from a single human teacher, despite gathering feedback from a range of teachers with varying levels of rationality and expertise. For example, Stiennon et al. [8], Bai et al. [13] and Ouyang et al. [15] assume that all feedback comes from a single teacher, but find that annotators and researchers actually disagree 23% to 37% of the time. Reward learning has been shown to be highly sensitive to incorrect assumptions about the process that generates feedback [16, 17, 18, 19], so this single-teacher assumption exposes these systems to dangerous failures [20]. Ideally, RLHF systems should consider the differences between each teacher to improve their safety and reliability. To leverage multiple teachers in RLHF, we introduce a novel problem called a Hidden Utility Bandit (HUB), illustrated in Figure 1.