72da7fd6d1302c0a159f6436d01e9eb0-AuthorFeedback.pdf

Neural Information Processing Systems 

DPO is an8 off-policy actor-critic framework which requires that all state action pairs are visited "enough" in order to ensure9 convergence, which is a theoretical assumption in various off-policy algorithms. Thisform oflearning initself18 is often unstable (resulting in mode-collapse) and still lacks theoretical guarantees and stability assurances. Theformeriswhatclassical algorithms suchasCPI(Kakade andLangford 2002)require, whereas the37 latter is what occurs in the standard policy gradient approaches. Gaussian or42 Delta distributions are limited totheir set, and thus can'tensure convergence toaglobal extrema.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found