Appendix: On the Expressivity of Markov Reward

Apr-25-2026, 14:37:29 GMT–Neural Information Processing Systems

We first address questions that might arise in response to the main text. That is, if Alice chooses a SOAP, PO, or TO for Bob to learn to solve, when can Alice determine Bob has solved the task? A: Bob can be said to be doing better on a given task if his behavior improves, as is typical in evaluating behavior under reward. The difference with SOAPs, POs, and TOs is that we measure improvement relative to the task rather than reward. For instance, given a SOAP, we might say that Bob has solved the task once he has found one of the good policies, and we might measure Bob's progress on a task in terms of the distance of his greedy policy to one of the good policies (as done in our learning experiments). The same reasoning applies to POs and TOs: Bob is doing better on a task in so far as his greedy policy (or trajectories) is (are) higher up the ordering. That is, the studied reward functions must be a function of s, (s,a), or (s,a,s0). A: Indeed, as discussed in our introduction, our goal is to examine the expressivity of Markov rewards in the context of finite MDPs.

artificial intelligence, machine learning, reward function, (18 more...)

Neural Information Processing Systems

Apr-25-2026, 14:37:29 GMT

Conferences PDF

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Representation & Reasoning (0.96)

Duplicate Docs Excel Report

Title
Appendix: OntheExpressivityofMarkovReward

Similar Docs Excel Report more

Title	Similarity	Source
None found