Appendix AMDPExamplesforSection3

Feb-8-2026, 03:56:24 GMT–Neural Information Processing Systems

Inthestates0,thepolicyπ selects the oracle with largest value ins0 and goes left. It subsequently selects the right oracle in s1 and left ins4 to get the optimal reward.πmax On the other hand, if we swap the rewards ofs7 and s11, then π chooses the right action ins0 and gets a suboptimal reward. Several prior works proposed empirical approaches to IL settings with multiple oracles. InfoGAIL tries to recover this oracle mixture.

bfmax, doubleinvertedpendulum, oracle, (15 more...)

Neural Information Processing Systems

Feb-8-2026, 03:56:24 GMT

Conferences PDF

Add feedback

Duplicate Docs Excel Report

Title
3c56fe2f24038c4d22b9eb0aca78f590-Supplemental.pdf

Similar Docs Excel Report more

Title	Similarity	Source
None found