Reward is Enough for Convex MDPs

Neural Information Processing Systems 

Maximising a cumulative reward function that is Markov and stationary, i.e., defined over state-action pairs and independent of time, is sufficient to capture many