114292cf3f930ba157ed33f66997fee2-Supplemental-Conference.pdf
–Neural Information Processing Systems
Butwellafterconvergence this flips and policy change is relatively high in these states. In Section 3.3 we already touched on this subject with the experiments onCATCH using value iteration. In this section we revisit the question and try to provide a more definite answer to it. Itiswellknownthatlimk Tkπq =qπ foranyq Q. Equipped with the concepts above, we can now present our example. Clearly,inthe first step, when we change fromπ to π1 = g(T1π qπ), the policy changes ins from redto green.
Neural Information Processing Systems
Feb-7-2026, 12:44:35 GMT