Goto

Collaborating Authors

 Reinforcement Learning





d010396ca8abf6ead8cacc2c2f2f26c7-Paper.pdf

Neural Information Processing Systems

A multi-armed bandit (MAB) problem is one of the classic models of sequential decision making (Auer et al., 2002; Agrawal and Goyal, 2012, 2013a).



A Hierarchical Reinforcement Learning Based Optimization Framework for Large-scale Dynamic Pickup and Delivery Problems Yi Ma

Neural Information Processing Systems

To address this problem, existing methods partition the overall DPDP into fixed-size sub-problems by caching online generated orders and solve each sub-problem, or on this basis to utilize the predicted future orders to optimize each sub-problem further. However, the solution quality and efficiency of these methods are unsatisfactory, especially when the problem scale is very large.




RMIX: LearningRisk-SensitivePoliciesfor CooperativeReinforcementLearningAgents

Neural Information Processing Systems

Current value-based multi-agent reinforcement learning methods optimize individual Q values to guide individuals' behaviours via centralized training with decentralized execution (CTDE). However, such expected, i.e., risk-neutral, Q value is not sufficient even with CTDE due to the randomness of rewards and the uncertainty in environments, which causes the failure of these methods to train coordinating agents incomplexenvironments. Toaddress these issues, we propose RMIX, anovelcooperativeMARL method with theConditional Value at Risk (CVaR) measure over the learned distributions of individuals' Q values. Specifically, we first learn the return distributions of individuals to analytically calculate CVaRfordecentralized execution. Then,tohandle thetemporal nature of the stochastic outcomes during executions, we propose a dynamic risk level predictorforriskleveltuning.