Appendix AMDPExamplesforSection3
–Neural Information Processing Systems
Inthestates0,thepolicyπ selects the oracle with largest value ins0 and goes left. It subsequently selects the right oracle in s1 and left ins4 to get the optimal reward.πmax On the other hand, if we swap the rewards ofs7 and s11, then π chooses the right action ins0 and gets a suboptimal reward. Several prior works proposed empirical approaches to IL settings with multiple oracles. InfoGAIL tries to recover this oracle mixture.
Neural Information Processing Systems
Feb-8-2026, 03:56:24 GMT