A general Markov decision process formalism for action-state entropy-regularized reward maximization