environment step
Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training
Local prediction-error-based curiosity rewards focus on the current transition without considering the world model's cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it reduces to a tractable per-step form: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this baseline online with a learned critic co-trained alongside the world model; regressing a single scalar, the critic converges well before the world model saturates, redirecting exploration toward learnable transitions without oracle knowledge of the noise floor. The reward is higher for learnable transitions and collapses toward the baseline for stochastic ones, effectively separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error and visitation-count baselines in convergence speed and final world model accuracy.
_NeurIPS_2022__On_the_Effectiveness_of_Fine_tuning_Versus_Meta_reinforcement_Learning (1)
Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and If you ran experiments... (a) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? Please refer to both main text and appendix for experiment details. Did you report error bars (e.g., with respect to the random seed after running experiments multiple All adaptation experiments in Procgen and RLBench are run for 3 seeds. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal As stated in section 2, we use RTX A5000 GPUs each with 24GB memory. C2F-ARM algorithm and training framework are built based on the original author's implementation Did you mention the license of the assets?
- Asia > China > Tianjin Province > Tianjin (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Information Technology (1.00)
- Energy (0.67)
- Leisure & Entertainment > Games > Computer Games (0.67)
- Europe > Poland > Masovia Province > Warsaw (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
- North America > United States > Virginia (0.40)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)
- Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.66)
SupplementaryMaterialforRethinkingValue FunctionLearningforGeneralizationin ReinforcementLearning
Then,wecalculatethe mean stiffness of the value network across all state pairs and report its average computed over all trainingepochs. Eachagentis trained on 200 training levels for 25M environment steps. The mean and standard deviation are computedover10differentruns. Morespecifically,wecollect100 training episodes throughout the training and evaluate the value network prediction for the initial stateofeachtrajectory. Each agent is trained on 200 training levels for 25M environment steps.
A Potential Negative Societal Impacts
We have not trained our models with sensitive or private data, and we emphasize that our model's direct L( n) other than the constant one as long as g (n) and l ( n) are positively correlated. The results for the baselines AdaSubS, kSubS, BC, CQL, DT, and HIPS with learned models were copied from [18]. The total number of GPU hours used on this work was approximately 7,500. We used 6 CPU workers (AMD Trento) per GPU. In the latter case, completeness cannot be guaranteed.
Agent 1 Agent 2 River Tiles (a) The initial setup with two agents and two river
Agent 1's action is resolved first. Figure 8: An example of Agent 1 using the "clean" action while facing East. The "main" beam extends directly in front of the agent, while two auxiliary A beam stops when it hits a dirty river tile. The Sequential Social Dilemma Games, introduced in Leibo et al. [2017], are a kind of MARL All of these have open source implementations in [Vinitsky et al., 2019]. The cleaning beam is shown in Figure 8a.