Goto

Collaborating Authors

 Directed Networks


XGBoostLSS -- An extension of XGBoost to probabilistic forecasting

arXiv.org Artificial Intelligence

We propose a new framework of XGBoost that predicts the entire conditional distribution of a univariate response variable. In particular, XGBoostLSS models all moments of a parametric distribution, i.e., mean, location, scale and shape (LSS), instead of the conditional mean only. Choosing from a wide range of continuous, discrete and mixed discrete-continuous distribution, modelling and predicting the entire conditional distribution greatly enhances the flexibility of XGBoost, as it allows to gain additional insight into the data generating process, as well as to create probabilistic forecasts from which prediction intervals and quantiles of interest can be derived. We present both a simulation study and real world examples that demonstrate the benefits of our approach.


The Design of Mutual Information

arXiv.org Machine Learning

We derive the functional form of mutual information (MI) from a set of design criteria and a principle of maximal sufficiency. The (MI) between two sets of propositions is a global quantifier of correlations and is implemented as a tool for ranking joint probability distributions with respect to said correlations. The derivation parallels the derivations of relative entropy with an emphasis on the behavior of independent variables. By constraining the functional $I$ according to special cases, we arrive at its general functional form and hence establish a clear meaning behind its definition. We also discuss the notion of sufficiency and offer a new definition which broadens its applicability.


Time series cluster kernels to exploit informative missingness and incomplete label information

arXiv.org Machine Learning

The time series cluster kernel (TCK) provides a powerful tool for analysing multivariate time series subject to missing data. TCK is designed using an ensemble learning approach in which Bayesian mixture models form the base models. Because of the Bayesian approach, TCK can naturally deal with missing values without resorting to imputation and the ensemble strategy ensures robustness to hyperparameters, making it particularly well suited for unsupervised learning. However, TCK assumes missing at random and that the underlying missingness mechanism is ignorable, i.e. uninformative, an assumption that does not hold in many real-world applications, such as e.g. medicine. To overcome this limitation, we present a kernel capable of exploiting the potentially rich information in the missing values and patterns, as well as the information from the observed data. In our approach, we create a representation of the missing pattern, which is incorporated into mixed mode mixture models in such a way that the information provided by the missing patterns is effectively exploited. Moreover, we also propose a semi-supervised kernel, capable of taking advantage of incomplete label information to learn more accurate similarities. Experiments on benchmark data, as well as a real-world case study of patients described by longitudinal electronic health record data who potentially suffer from hospital-acquired infections, demonstrate the effectiveness of the proposed methods.


Variational Autoencoders and Nonlinear ICA: A Unifying Framework

arXiv.org Machine Learning

The framework of variational autoencoders allows us to efficiently learn deep latent-variable models, such that the model's marginal distribution over observed variables fits the data. Often, we're interested in going a step further, and want to approximate the true joint distribution over observed and latent variables, including the true prior and posterior distributions over latent variables. This is known to be generally impossible due to unidentifiability of the model. We address this issue by showing that for a broad family of deep latent-variable models, identification of the true joint distribution over observed and latent variables is actually possible up to a simple transformation, thus achieving a principled and powerful form of disentanglement. Our result requires a factorized prior distribution over the latent variables that is conditioned on an additionally observed variable, such as a class label or almost any other observation. We build on recent developments in nonlinear ICA, which we extend to the case with noisy, undercomplete or discrete observations, integrated in a maximum likelihood framework. The result also trivially contains identifiable flow-based generative models as a special case.


Exploiting Causality for Selective Belief Filtering in Dynamic Bayesian Networks (Extended Abstract)

arXiv.org Artificial Intelligence

Dynamic Bayesian networks (DBNs) are a general model for stochastic processes with partially observed states. Belief filtering in DBNs is the task of inferring the belief state (i.e. the probability distribution over process states) based on incomplete and uncertain observations. In this article, we explore the idea of accelerating the filtering task by automatically exploiting causality in the process. We consider a specific type of causal relation, called passivity, which pertains to how state variables cause changes in other variables. We present the Passivity-based Selective Belief Filtering (PSBF) method, which maintains a factored belief representation and exploits passivity to perform selective updates over the belief factors. PSBF is evaluated in both synthetic processes and a simulated multi-robot warehouse, where it outperformed alternative filtering methods by exploiting passivity.


Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process

arXiv.org Artificial Intelligence

We tackle the problem of acting in an unknown finite and discrete Markov Decision Process (MDP) for which the expected shortest path from any state to any other state is bounded by a finite number $D$. An MDP consists of $S$ states and $A$ possible actions per state. Upon choosing an action $a_t$ at state $s_t$, one receives a real value reward $r_t$, then one transits to a next state $s_{t+1}$. The reward $r_t$ is generated from a fixed reward distribution depending only on $(s_t, a_t)$ and similarly, the next state $s_{t+1}$ is generated from a fixed transition distribution depending only on $(s_t, a_t)$. The objective is to maximize the accumulated rewards after $T$ interactions. In this paper, we consider the case where the reward distributions, the transitions, $T$ and $D$ are all unknown. We derive the first polynomial time Bayesian algorithm, BUCRL{} that achieves up to logarithm factors, a regret (i.e the difference between the accumulated rewards of the optimal policy and our algorithm) of the optimal order $\tilde{\mathcal{O}}(\sqrt{DSAT})$. Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret. We perform experiments in a variety of environments that demonstrate the superiority of our algorithm over previous techniques. Our work also illustrates several results that will be of independent interest. In particular, we derive a sharper upper bound for the KL-divergence of Bernoulli random variables. We also derive sharper upper and lower bounds for Beta and Binomial quantiles. All the bound are very simple and only use elementary functions.


Goal Recognition Design in Deterministic Environments

Journal of Artificial Intelligence Research

Goal recognition design (GRD) facilitates understanding the goals of acting agents through the analysis and redesign of goal recognition models, thus offering a solution for assessing and minimizing the maximal progress of any agent in the model before goal recognition is guaranteed. In a nutshell, given a model of a domain and a set of possible goals, a solution to a GRD problem determines (1) the extent to which actions performed by an agent within the model reveal the agentโ€™s objective; and (2) how best to modify the model so that the objective of an agent can be detected as early as possible. This approach is relevant to any domain in which rapid goal recognition is essential and the model design can be controlled. Applications include intrusion detection, assisted cognition, computer games, and human-robot collaboration. A GRD problem has two components: the analyzed goal recognition setting, and a design model specifying the possible ways the environment in which agents act can be modified so as to facilitate recognition. This work formulates a general framework for GRD in deterministic and partially observable environments, and offers a toolbox of solutions for evaluating and optimizing model quality for various settings. For the purpose of evaluation we suggest the worst case distinctiveness (WCD) measure, which represents the maximal cost of a path an agent may follow before its goal can be inferred by a goal recognition system. We offer novel compilations to classical planning for calculating WCD in settings where agents are bounded-suboptimal. We then suggest methods for minimizing WCD by searching for an optimal redesign strategy within the space of possible modifications, and using pruning to increase efficiency. We support our approach with an empirical evaluation that measures WCD in a variety of GRD settings and tests the efficiency of our compilation-based methods for computing it. We also examine the effectiveness of reducing WCD via redesign and the performance gain brought about by our proposed pruning strategy.


Convergence Rates for Gaussian Mixtures of Experts

arXiv.org Machine Learning

We provide a theoretical treatment of over-specified Gaussian mixtures of experts with covariate-free gating networks. We establish the convergence rates of the maximum likelihood estimation (MLE) for these models. Our proof technique is based on a novel notion of \emph{algebraic independence} of the expert functions. Drawing on optimal transport theory, we establish a connection between the algebraic independence and a certain class of partial differential equations (PDEs). Exploiting this connection allows us to derive convergence rates and minimax lower bounds for parameter estimation.


Bayesian deep learning with hierarchical prior: Predictions from limited and noisy data

arXiv.org Machine Learning

Datasets in engineering applications are often limited and contaminated, mainly due to unavoidable measurement noise and signal distortion. Thus, using conventional data-driven approaches to build a reliable discriminative model, and further applying this identified surrogate to uncertainty analysis remains to be very challenging. A deep learning approach is presented to provide predictions based on limited and noisy data. To address noise perturbation, the Bayesian learning method that naturally facilitates an automatic updating mechanism is considered to quantify and propagate model uncertainties into predictive quantities. Specifically, hierarchical Bayesian modeling (HBM) is first adopted to describe model uncertainties, which allows the prior assumption to be less subjective, while also makes the proposed surrogate more robust. Next, the Bayesian inference is seamlessly integrated into the DL framework, which in turn supports probabilistic programming by yielding a probability distribution of the quantities of interest rather than their point estimates. Variational inference (VI) is implemented for the posterior distribution analysis where the intractable marginalization of the likelihood function over parameter space is framed in an optimization format, and stochastic gradient descent method is applied to solve this optimization problem. Finally, Monte Carlo simulation is used to obtain an unbiased estimator in the predictive phase of Bayesian inference, where the proposed Bayesian deep learning (BDL) scheme is able to offer confidence bounds for the output estimation by analyzing propagated uncertainties. The effectiveness of Bayesian shrinkage is demonstrated in improving predictive performance using contaminated data, and various examples are provided to illustrate concepts, methodologies, and algorithms of this proposed BDL modeling technique.


Asymptotic Bayes risk for Gaussian mixture in a semi-supervised setting

arXiv.org Machine Learning

Semi-supervised learning (SSL) uses unlabeled data for training and has been shown to greatly improve performances when compared to a supervised approach on the labeled data available. This claim depends both on the amount of labeled data available and on the algorithm used. In this paper, we compute analytically the gap between the best fully-supervised approach on labeled data and the best semi-supervised approach using both labeled and unlabeled data. We quantify the best possible increase in performance obtained thanks to the unlabeled data, i.e. we compute the accuracy increase due to the information contained in the unlabeled data. Our work deals with a simple high-dimensional Gaussian mixture model for the data in a Bayesian setting. Our rigorous analysis builds on recent theoretical breakthroughs in high-dimensional inference and a large body of mathematical tools from statistical physics initially developed for spin glasses.