A.1 Side-by-sidecomparisonofMDPandtMDP AtemporalMDPprocess: (S,A,pinit,ptrans,r) Probabilityofatrajectoryτ: pπ(τ) = pinit(s0)
–Neural Information Processing Systems
This proof draws closely tothe proof ofthe temporal policygradient theorem. First, itistrivial to show thatGUBch i can be derived fromsi. Because nodeiisnot aleaf node, it has not resulted in an integral solution, and hence processing nodei does not change the GUB.Andsince ch i isprocesseddirectlyafternode i,wenecessarilyhave GUBch Solid lines show the moving average. The results are averaged over the solving runs that finished successfully for all methods. This is,ifasolving run reached thetime limit foranymethod, this isexcluded from the average.
Neural Information Processing Systems
Feb-9-2026, 21:01:32 GMT