A.1 Side-by-sidecomparisonofMDPandtMDP AtemporalMDPprocess: (S,A,pinit,ptrans,r) Probabilityofatrajectoryτ: pπ(τ) = pinit(s0)

Neural Information Processing Systems 

This proof draws closely tothe proof ofthe temporal policygradient theorem. First, itistrivial to show thatGUBch i can be derived fromsi. Because nodeiisnot aleaf node, it has not resulted in an integral solution, and hence processing nodei does not change the GUB.Andsince ch i isprocesseddirectlyafternode i,wenecessarilyhave GUBch Solid lines show the moving average. The results are averaged over the solving runs that finished successfully for all methods. This is,ifasolving run reached thetime limit foranymethod, this isexcluded from the average.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found