Entropic Desired Dynamics for Intrinsic Control: Supplemental Material
–Neural Information Processing Systems
While the focus of this work has been on unsupervised evaluation, here we provide a proof-of-concept that EDDICT can aid in the achievement of task rewards via structured exploration. To do so, we train EDDICT alongside a standard policy maximizing task rewards, which we refer to as the task policy. While the EDDICT training procedure remains unchanged, experience for the latter is generated by a behavior policy which randomly switches between EDDICT's policy and the task policy at regular intervals (every 20 steps). The motivation is similar to recent work [8] on temporally extended ɛ-greedy: temporally coherent exploration can more rapidly cover the state space. Two separate networks were used to instantiate EDDICT and the task policy, so any potential benefits must arise from improved exploration rather than e.g.
Neural Information Processing Systems
May-29-2025, 00:18:07 GMT