Appendix
–Neural Information Processing Systems
Problem with selecting oracles based on initial value. Alternatively, we can switch between the oracles once to get a reward of 3/4, and twice to get the optimal reward of 1. All terminal states not shown give a reward of 0 and intermediate states have no rewards. The optimal terminal state is outlined in bold. Consequently it goes right and eventually obtains a suboptimal reward of 3/4.
Neural Information Processing Systems
May-20-2025, 22:04:44 GMT