Entropic Desired Dynamics for Intrinsic Control: Supplemental Material

May-29-2025, 00:18:07 GMT–Neural Information Processing Systems

While the focus of this work has been on unsupervised evaluation, here we provide a proof-of-concept that EDDICT can aid in the achievement of task rewards via structured exploration. To do so, we train EDDICT alongside a standard policy maximizing task rewards, which we refer to as the task policy. While the EDDICT training procedure remains unchanged, experience for the latter is generated by a behavior policy which randomly switches between EDDICT's policy and the task policy at regular intervals (every 20 steps). The motivation is similar to recent work [8] on temporally extended ɛ-greedy: temporally coherent exploration can more rapidly cover the state space. Two separate networks were used to instantiate EDDICT and the task policy, so any potential benefits must arise from improved exploration rather than e.g.

arxiv preprint arxiv, machine learning, reinforcement learning, (15 more...)

Neural Information Processing Systems

May-29-2025, 00:18:07 GMT

Conferences PDF

Add feedback

Country:
- Oceania > Australia (0.14)

Industry:
- Leisure & Entertainment > Games (0.31)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.51)