example where multi step outperforms one step
–Neural Information Processing Systems
As explained in the main text, this section presents an example that is only a slight modification of the one in Figure 4, but where a multi-step approach is clearly preferred over just one step. The data-generating and learning processes are exactly the same (100 trajectories of length 100, discount 0.9, α = 0.1for reverse KL regularization). The only difference is that rather than using a behavior that is a mixture of optimal and uniform, we use a behavior that is a mixture of maximally suboptimal and uniform. If we call the suboptimal policy π (which always goes down and left in our gridworld), then the behavior for the modified example is β = 0.2 π +0.8 u, where uis uniform. Results are shown in Figure 7. Figure 7: A gridworld example with modified behavior where multi-step is much better than one-step.
Neural Information Processing Systems
Apr-25-2026, 04:38:49 GMT
- Technology: