Goto

Collaborating Authors

 bellman equation





FlowNetworkbasedGenerativeModelsfor Non-IterativeDiverseCandidateGeneration

Neural Information Processing Systems

This paper is about the problem of learning a stochastic policy for generating an object (like a molecular graph) from a sequence of actions, such that the probability of generating an object isproportional to agiven positivereward for that object.




Appendix: Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms

Neural Information Processing Systems

Thus the optimal average reward of the original MDP and modified MDP differ by O ( ϵ). To ensure Assumption 3.1 (b) is satisfied, an aperiodicity transformation can be implemented. The proof of this theorem can be found in [Sch71]. From Lemma 2.2, we thus have, ( J In order to iterate Equation (8), need to ensure the terms are non-negative. Theorem 3.3 presents an upper bound on the error in terms of the average reward.




ExponentialBellmanEquationandImprovedRegret BoundsforRisk-SensitiveReinforcementLearning

Neural Information Processing Systems

We study risk-sensitive reinforcement learning (RL) based on the entropic risk measure. Although existing works haveestablished non-asymptotic regret guarantees for this problem, they leave open an exponential gap between the upper and lower bounds. We identify the deficiencies in existing algorithms and their analysis that result in such a gap. To remedy these deficiencies, we investigate a simple transformation of the risk-sensitive Bellman equations, which we call theexponentialBellmanequation.