Goto

Collaborating Authors

 Learning Graphical Models


Regularized Softmax Deep Multi-Agent Q-Learning

Neural Information Processing Systems

Tackling overestimation in Q-learning is an important problem that has been extensively studied in single-agent reinforcement learning, but has received comparatively little attention in the multi-agent setting. In this work, we empirically demonstrate that QMIX, a popular Q-learning algorithm for cooperative multiagent reinforcement learning (MARL), suffers from a more severe overestimation in practice than previously acknowledged, and is not mitigated by existing approaches. We rectify this with a novel regularization-based update scheme that penalizes large joint action-values that deviate from a baseline and demonstrate its effectiveness in stabilizing learning. Furthermore, we propose to employ a softmax operator, which we efficiently approximate in a novel way in the multiagent setting, to further reduce the potential overestimation bias. Our approach, Regularized Softmax (RES) Deep Multi-Agent Q-Learning, is general and can be applied to any Q-learning based MARL algorithm. We demonstrate that, when applied to QMIX, RES avoids severe overestimation and significantly improves performance, yielding state-of-the-art results in a variety of cooperative multi-agent tasks, including the challenging StarCraft II micromanagement benchmarks.



Sample Complexity Bounds for Score-Matching: Causal Discovery and Generative Modeling

Neural Information Processing Systems

This paper provides statistical sample complexity bounds for score-matching and its applications in causal discovery. We demonstrate that accurate estimation of the score function is achievable by training a standard deep ReLU neural network using stochastic gradient descent. We establish bounds on the error rate of recovering causal relationships using the score-matching-based causal discovery method of Rolland et al. [2022], assuming a sufficiently good estimation of the score function. Finally, we analyze the upper bound of score-matching estimation within the scorebased generative modeling, which has been applied for causal discovery but is also of independent interest within the domain of generative models.


Sample Complexity Bounds for Score-Matching: Causal Discovery and Generative Modeling

Neural Information Processing Systems

This paper provides statistical sample complexity bounds for score-matching and its applications in causal discovery. We demonstrate that accurate estimation of the score function is achievable by training a standard deep ReLU neural network using stochastic gradient descent. We establish bounds on the error rate of recovering causal relationships using the score-matching-based causal discovery method of Rolland et al. [2022], assuming a sufficiently good estimation of the score function. Finally, we analyze the upper bound of score-matching estimation within the scorebased generative modeling, which has been applied for causal discovery but is also of independent interest within the domain of generative models.


Single Layer Predictive Normalized Maximum Likelihood for Out-of-Distribution Detection-Supplementary material-Anonymous Author(s) Affiliation Address email

Neural Information Processing Systems

We use the same notations as in section 4.2 Denote ec as a one-hot row vector of the true label, we define the hypothesis set that genie is allowed3 to choose from as4 PΘ = pθ(y|x) = 1 2πσ2 exp 1 2σ2 y f(x>nθ) e>c We simulate the response of the pNML regret for two classes (C=2) and divide it by logC to have11 the regret bounded between 0 and 1. Figure 1 shows the regret behaviour for different p1 (the ERM12 probability assignment of class 1) as a function of x>g.13 For an ERM model that is certain on the prediction (p1 = 0.99 that is represented by the purple14 curve), a slight variation of x>g causes a large response of the regret comparing to p1 that equals15 0.55 and 0.85. Next, 20 we compute the correlation matrix of the training embeddings and perform an SVD decomposition. For the SVHN training set, most of the energy is located in the first 50 eigenvalues and then 24 there is a significant decrease of approximately 103. The same phenomenon is also seen in figure 2a 25 that shows the eigenvalues of ResNet-40 model.


Single Layer Predictive Normalized Maximum Likelihood for Out-of-Distribution Detection

Neural Information Processing Systems

Detecting out-of-distribution (OOD) samples is vital for developing machine learning based models for critical safety systems. Common approaches for OOD detection assume access to some OOD samples during training which may not be available in a real-life scenario. Instead, we utilize the predictive normalized maximum likelihood (pNML) learner, in which no assumptions are made on the tested input. We derive an explicit expression of the pNML and its generalization error, denoted as the regret, for a single layer neural network (NN). We show that this learner generalizes well when (i) the test vector resides in a subspace spanned by the eigenvectors associated with the large eigenvalues of the empirical correlation matrix of the training data, or (ii) the test sample is far from the decision boundary. Furthermore, we describe how to efficiently apply the derived pNML regret to any pretrained deep NN, by employing the explicit pNML for the last layer, followed by the softmax function. Applying the derived regret to deep NN requires neither additional tunable parameters nor extra data. We extensively evaluate our approach on 74 OOD detection benchmarks using DenseNet-100, ResNet-34, and WideResNet40 models trained with CIFAR-100, CIFAR-10, SVHN, and ImageNet-30 showing a significant improvement of up to 15.6% over recent leading methods.


Incrementality Bidding via Reinforcement Learning under Mixed and Delayed Rewards

Neural Information Processing Systems

Incrementality, which measures the causal effect of showing an ad to a potential customer (e.g. a user in an internet platform) versus not, is a central object for advertisers in online advertising platforms. This paper investigates the problem of how an advertiser can learn to optimize the bidding sequence in an online manner without knowing the incrementality parameters in advance. We formulate the offline version of this problem as a specially structured episodic Markov Decision Process (MDP) and then, for its online learning counterpart, propose a novel reinforcement learning (RL) algorithm with regret at most eO(H2 T), which depends on the number of rounds H and number of episodes T, but does not depend on the number of actions (i.e., possible bids). A fundamental difference between our learning problem from standard RL problems is that the realized reward feedback from conversion incrementality is mixed and delayed. To handle this difficulty we propose and analyze a novel pairwise moment-matching algorithm to learn the conversion incrementality, which we believe is of independent interest.


Risk-Averse Bayes-Adaptive Reinforcement Learning

Neural Information Processing Systems

In this work, we address risk-averse Bayes-adaptive reinforcement learning. We pose the problem of optimising the conditional value at risk (CVaR) of the total return in Bayes-adaptive Markov decision processes (MDPs). We show that a policy optimising CVaR in this setting is risk-averse to both the epistemic uncertainty due to the prior distribution over MDPs, and the aleatoric uncertainty due to the inherent stochasticity of MDPs. We reformulate the problem as a two-player stochastic game and propose an approximate algorithm based on Monte Carlo tree search and Bayesian optimisation. Our experiments demonstrate that our approach significantly outperforms baseline approaches for this problem.


Self-Consistent Models and Values

Neural Information Processing Systems

Learned models of the environment provide reinforcement learning (RL) agents with flexible ways of making predictions about the environment. In particular, models enable planning, i.e. using more computation to improve value functions or policies, without requiring additional environment interactions. In this work, we investigate a way of augmenting model-based RL, by additionally encouraging a learned model and value function to be jointly self-consistent. Our approach differs from classic planning methods such as Dyna, which only update values to be consistent with the model. We propose multiple self-consistency updates, evaluate these in both tabular and function approximation settings, and find that, with appropriate choices, self-consistency helps both policy evaluation and control.