Goto

Collaborating Authors

 reinforcementlearning



Overleaf Example

Neural Information Processing Systems

We model episode sessions--parts of the episode where the latent state isfixed--and propose three keymodifications toexisting meta-RL methods: (i) consistency of latent information within sessions, (ii) session masking, and (iii) priorlatent conditioning.


SupplementaryMaterialfor" HierarchicalAdaptive ValueEstimationforMulti-modalVisual ReinforcementLearning "

Neural Information Processing Systems

Section C describes the details of the experimental setup, including network architectures, hyperparameters,andhardwaredetails. Thisoutcomeemphasizes the necessity of feature interaction or feature fusion to tackle intricate situations. Furthermore, an amalgamation of feature fusion and value fusion can offer better performance. This adjustment allows us to evaluate the robustness and adaptability of our approach in handling a larger number of vehicles in the environment. As we increase the number of vehicles on the road, Fig. A2 (a) clearly indicates that HAVE consistently delivers the highest performance. The training and testing curves of HAVE and other comparable methods are given in A4.



Distributional Reward Decomposition for Reinforcement Learning

Neural Information Processing Systems

Van Seijen et al. [2017] propose to split a state into different sub-states, each with a sub-reward obtained bytraining ageneral valuefunction, andlearnmultiple valuefunctions withsub-rewards. The architecture is rather limited due to requiring prior knowledge of how to split into sub-states.



Temporal Regularization for Markov Decision Process

Neural Information Processing Systems

Yetinreinforcementlearning,duetothenatureofthe Bellman equation, there isanopportunity toalsoexploit temporal regularization based on smoothness in value estimates over trajectories. This paper explores a class of methods for temporal regularization.


SupplementaryMaterialforRethinkingValue FunctionLearningforGeneralizationin ReinforcementLearning

Neural Information Processing Systems

Then,wecalculatethe mean stiffness of the value network across all state pairs and report its average computed over all trainingepochs. Eachagentis trained on 200 training levels for 25M environment steps. The mean and standard deviation are computedover10differentruns. Morespecifically,wecollect100 training episodes throughout the training and evaluate the value network prediction for the initial stateofeachtrajectory. Each agent is trained on 200 training levels for 25M environment steps.


ContrastiveIntrinsicControlforUnsupervised ReinforcementLearning

Neural Information Processing Systems

Unlikeknowledge-based anddata-basedalgorithms, competence-based algorithms simultaneously address both the exploration challenge as well as distilling the generated experience in the form of reusable skills.