Goto

Collaborating Authors

 Uncertainty




Training and Inference on Any-Order Autoregressive Models the Right Way

Neural Information Processing Systems

Conditional inference on arbitrary subsets of variables is a core problem in probabilistic inference with important applications such as masked language modeling and image inpainting. In recent years, the family of Any-Order Autoregressive Models (AO-ARMs) - closely related to popular models such as BERT and XLNet - has shown breakthrough performance in arbitrary conditional tasks across a sweeping range of domains. But, in spite of their success, in this paper we identify significant improvements to be made to previous formulations of AO-ARMs. First, we show that AO-ARMs suffer from redundancy in their probabilistic model, i.e., they define the same distribution in multiple different ways. We alleviate this redundancy by training on a smaller set of univariate conditionals that still maintains support for efficient arbitrary conditional inference. Second, we upweight the training loss for univariate conditionals that are evaluated more frequently during inference. Our method leads to improved performance with no compromises on tractability, giving state-of-the-art likelihoods in arbitrary conditional modeling on text (Text8), image (CIFAR10, ImageNet32), and continuous tabular data domains.


Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation

Neural Information Processing Systems

We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs). In this setting, the agent works in two phases. In the exploration phase, the agent interacts with the environment and collects samples without the reward. In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy. We propose a new provably efficient algorithm, called UCRL-RFE under the Linear Mixture MDP assumption, where the transition probability kernel of the MDP can be parameterized by a linear function over certain feature mappings defined on the triplet of state, action, and next state.


Maximum Likelihood Training of Score-Based Diffusion Models

Neural Information Processing Systems

Score-based diffusion models synthesize samples by reversing a stochastic process that diffuses data to noise, and are trained by minimizing a weighted combination of score matching losses. The log-likelihood of score-based diffusion models can be tractably computed through a connection to continuous normalizing flows, but log-likelihood is not directly optimized by the weighted combination of score matching losses. We show that for a specific weighting scheme, the objective upper bounds the negative log-likelihood, thus enabling approximate maximum likelihood training of score-based diffusion models. We empirically observe that maximum likelihood training consistently improves the likelihood of score-based diffusion models across multiple datasets, stochastic processes, and model architectures. Our best models achieve negative log-likelihoods of 2.83 and 3.76 bits/dim on CIFAR-10 and ImageNet 32 ˆ32 without any data augmentation, on a par with state-of-the-art autoregressive models on these tasks.



Riemannian Score-Based Generative Modelling

Neural Information Processing Systems

Score-based generative models (SGMs) are a powerful class of generative models that exhibit remarkable empirical performance. Score-based generative modelling (SGM) consists of a "noising" stage, whereby a diffusion is used to gradually add Gaussian noise to data, and a generative model, which entails a "denoising" process defined by approximating the time-reversal of the diffusion. Existing SGMs assume that data is supported on a Euclidean space, i.e. a manifold with flat geometry. In many domains such as robotics, geoscience or protein modelling, data is often naturally described by distributions living on Riemannian manifolds and current SGM techniques are not appropriate. We introduce here Riemannian Score-based Generative Models (RSGMs), a class of generative models extending SGMs to Riemannian manifolds. We demonstrate our approach on a variety of manifolds, and in particular with earth and climate science spherical data.



Single Layer Predictive Normalized Maximum Likelihood for Out-of-Distribution Detection-Supplementary material-Anonymous Author(s) Affiliation Address email

Neural Information Processing Systems

We use the same notations as in section 4.2 Denote ec as a one-hot row vector of the true label, we define the hypothesis set that genie is allowed3 to choose from as4 PΘ = pθ(y|x) = 1 2πσ2 exp 1 2σ2 y f(x>nθ) e>c We simulate the response of the pNML regret for two classes (C=2) and divide it by logC to have11 the regret bounded between 0 and 1. Figure 1 shows the regret behaviour for different p1 (the ERM12 probability assignment of class 1) as a function of x>g.13 For an ERM model that is certain on the prediction (p1 = 0.99 that is represented by the purple14 curve), a slight variation of x>g causes a large response of the regret comparing to p1 that equals15 0.55 and 0.85. Next, 20 we compute the correlation matrix of the training embeddings and perform an SVD decomposition. For the SVHN training set, most of the energy is located in the first 50 eigenvalues and then 24 there is a significant decrease of approximately 103. The same phenomenon is also seen in figure 2a 25 that shows the eigenvalues of ResNet-40 model.


Single Layer Predictive Normalized Maximum Likelihood for Out-of-Distribution Detection

Neural Information Processing Systems

Detecting out-of-distribution (OOD) samples is vital for developing machine learning based models for critical safety systems. Common approaches for OOD detection assume access to some OOD samples during training which may not be available in a real-life scenario. Instead, we utilize the predictive normalized maximum likelihood (pNML) learner, in which no assumptions are made on the tested input. We derive an explicit expression of the pNML and its generalization error, denoted as the regret, for a single layer neural network (NN). We show that this learner generalizes well when (i) the test vector resides in a subspace spanned by the eigenvectors associated with the large eigenvalues of the empirical correlation matrix of the training data, or (ii) the test sample is far from the decision boundary. Furthermore, we describe how to efficiently apply the derived pNML regret to any pretrained deep NN, by employing the explicit pNML for the last layer, followed by the softmax function. Applying the derived regret to deep NN requires neither additional tunable parameters nor extra data. We extensively evaluate our approach on 74 OOD detection benchmarks using DenseNet-100, ResNet-34, and WideResNet40 models trained with CIFAR-100, CIFAR-10, SVHN, and ImageNet-30 showing a significant improvement of up to 15.6% over recent leading methods.