Goto

Collaborating Authors

 Bayesian Inference


The Reciprocal Bayesian LASSO

arXiv.org Machine Learning

Throughout the course of the paper, we assume that y and X have been centered at 0 so there is no intercept in the model, where y is the n 1 vector of centered responses, X is the n p matrix of standardized regressors, β is the p 1 vector of coefficients to be estimated, and null is the n 1 vector of independent and identically distributed normal errors with mean 0 and variance σ 2 . Compared to traditional penalization functions that are usually symmetric about 0, continuous and nondecreasing in (0,), the rLASSO penalty functions are decreasing in (0,), discontinuous at 0, and converge to infinity when the coefficients approach zero. From a theoretical standpoint, rLASSO shares the same oracle property and same rate of estimation error with other LASSOtype penalty functions. An early reference to this class of models can be found in Song and Liang (2015), with more recent papers focusing on large sample asymptotics, along with computational strategies for frequentist estimation (Shin et al., 2018; Song, 2018). Our approach differs from this line of work in adopting a Bayesian perspective on rLASSO estimation. Ideally, a Bayesian solution can be obtained by placing appropriate priors on the regression coefficients that will mimic the effects of the rLASSO penalty. As apparent from (1), this arises in assuming a prior for β that decomposes as a product of independent inverse Laplace (double exponential) densities: π (β) p null j 1 λ 2β 2 j exp{ λ β j }I { β j null 0 }.


Optimal estimation of sparse topic models

arXiv.org Machine Learning

Topic models have become popular tools for dimension reduction and exploratory analysis of text data which consists in observed frequencies of a vocabulary of $p$ words in $n$ documents, stored in a $p\times n$ matrix. The main premise is that the mean of this data matrix can be factorized into a product of two non-negative matrices: a $p\times K$ word-topic matrix $A$ and a $K\times n$ topic-document matrix $W$. This paper studies the estimation of $A$ that is possibly element-wise sparse, and the number of topics $K$ is unknown. In this under-explored context, we derive a new minimax lower bound for the estimation of such $A$ and propose a new computationally efficient algorithm for its recovery. We derive a finite sample upper bound for our estimator, and show that it matches the minimax lower bound in many scenarios. Our estimate adapts to the unknown sparsity of $A$ and our analysis is valid for any finite $n$, $p$, $K$ and document lengths. Empirical results on both synthetic data and semi-synthetic data show that our proposed estimator is a strong competitor of the existing state-of-the-art algorithms for both non-sparse $A$ and sparse $A$, and has superior performance is many scenarios of interest.


Estimating Latent Demand of Shared Mobility through Censored Gaussian Processes

arXiv.org Machine Learning

Transport demand is highly dependent on supply, especially for shared transport services where availability is often limited. As observed demand cannot be higher than available supply, historical transport data typically represents a biased, or censored, version of the true underlying demand pattern. Without explicitly accounting for this inherent distinction, predictive models of demand would necessarily represent a biased version of true demand, thus less effectively predicting the needs of service users. To counter this problem, we propose a general method for censorship-aware demand modeling, for which we devise a censored likelihood function. We apply this method to the task of shared mobility demand prediction by incorporating the censored likelihood within a Gaussian Process model, which can flexibly approximate arbitrary functional forms. Experiments on artificial and real-world datasets show how taking into account the limiting effect of supply on demand is essential in the process of obtaining an unbiased predictive model of user demand behavior.


The Incentives that Shape Behaviour

arXiv.org Artificial Intelligence

Which variables does an agent have an incentive to control with its decision, and which variables does it have an incentive to respond to? We formalize these incentives, and demonstrate unique graphical criteria for detecting them in any single-decision causal influence diagram. To this end, we introduce structural causal influence models, a hybrid of the influence diagram and structural causal model frameworks. Finally, we illustrate how these incentives predict agent incentives in both fairness and AI safety applications.


Fragmentation Coagulation Based Mixed Membership Stochastic Blockmodel

arXiv.org Machine Learning

The Mixed-Membership Stochastic Blockmodel~(MMSB) is proposed as one of the state-of-the-art Bayesian relational methods suitable for learning the complex hidden structure underlying the network data. However, the current formulation of MMSB suffers from the following two issues: (1), the prior information~(e.g. entities' community structural information) can not be well embedded in the modelling; (2), community evolution can not be well described in the literature. Therefore, we propose a non-parametric fragmentation coagulation based Mixed Membership Stochastic Blockmodel (fcMMSB). Our model performs entity-based clustering to capture the community information for entities and linkage-based clustering to derive the group information for links simultaneously. Besides, the proposed model infers the network structure and models community evolution, manifested by appearances and disappearances of communities, using the discrete fragmentation coagulation process (DFCP). By integrating the community structure with the group compatibility matrix we derive a generalized version of MMSB. An efficient Gibbs sampling scheme with Polya Gamma (PG) approach is implemented for posterior inference. We validate our model on synthetic and real world data.


Bayesian inference of dynamics from partial and noisy observations using data assimilation and machine learning

arXiv.org Machine Learning

The reconstruction from observations of high-dimensional chaotic dynamics such as geophysical flows is hampered by (i) the partial and noisy observations that can realistically be obtained, (ii) the need to learn from long time series of data, and (iii) the unstable nature of the dynamics. To achieve such inference from the observations over long time series, it has been suggested to combine data assimilation and machine learning in several ways. We show how to unify these approaches from a Bayesian perspective using expectation-maximization and coordinate descents. Implementations and approximations of these methods are also discussed. Finally, we numerically and successfully test the approach on two relevant low-order chaotic models with distinct identifiability.


Communication-Efficient Distributed Estimator for Generalized Linear Models with a Diverging Number of Covariates

arXiv.org Machine Learning

Distributed statistical inference has recently attracted immense attention. Herein, we study the asymptotic efficiency of the maximum likelihood estimator (MLE), the one-step MLE, and the aggregated estimating equation estimator for generalized linear models with a diverging number of covariates. Then a novel method is proposed to obtain an asymptotically efficient estimator for large-scale distributed data by two rounds of communication between local machines and the central server. The assumption on the number of machines in this paper is more relaxed and thus practical for real-world applications. Simulations and a case study demonstrate the satisfactory finite-sample performance of the proposed estimators. Keywords: Generalized linear models, Large-scale distributed data, Asymptotic efficiency, One-step MLE, Diverging p MSC: 62J12 1 . Introduction In modern times, large-scale data sets have become increasingly common, and they are often stored across multiple machines. Since communication cost between machines is considerably higher than the cost of conducting statistical analysis on a single machine (Jaggi et al., 2014; Smith et al., 2018), it is inefficient to calculate a global estimator by the transmission of the local data to a central machine. Further, the application of the traditional iterative algorithms in a distributed system, such as the Fisher-scoring algorithm for maximum likelihood estimator (MLE) in generalized linear models (GLMs), cannot avoid multiple rounds of communication that incurs exorbitant costs. Therefore, communication-efficient distributed algorithms must be developed to accommodate the new features of modern data sets.


Channels' Confirmation and Predictions' Confirmation: from the Medical Test to the Raven Paradox

arXiv.org Artificial Intelligence

After long arguments between positivism and falsificationism, the verification of universal hypotheses was replaced with the confirmation of uncertain major premises. Unfortunately, Hemple discovered the Raven Paradox (RP). Then, Carnap used the logical probability increment as the confirmation measure. So far, many confirmation measures have been proposed. Measure F among them proposed by Kemeny and Oppenheim possesses symmetries and asymmetries proposed by Elles and Fitelson, monotonicity proposed by Greco et al., and normalizing property suggested by many researchers. Based on the semantic information theory, a measure b* similar to F is derived from the medical test. Like the likelihood ratio, b* and F can only indicate the quality of channels or the testing means instead of the quality of probability predictions. And, it is still not easy to use b*, F, or another measure to clarify the RP. For this reason, measure c* similar to the correct rate is derived. The c* has the simple form: (a-c)/max(a, c); it supports the Nicod Criterion and undermines the Equivalence Condition, and hence, can be used to eliminate the RP. Some examples are provided to show why it is difficult to use one of popular confirmation measures to eliminate the RP. Measure F, b*, and c* indicate that fewer counterexamples' existence is more essential than more positive examples' existence, and hence, are compatible with Popper's falsification thought.


A Support Detection and Root Finding Approach for Learning High-dimensional Generalized Linear Models

arXiv.org Machine Learning

Feature selection is important for modeling high-dimensional data, where the number of variables can be much larger than the sample size. In this paper, we develop a support detection and root finding procedure to learn the high dimensional sparse generalized linear models and denote this method by GSDAR. Based on the KKT condition for $\ell_0$-penalized maximum likelihood estimations, GSDAR generates a sequence of estimators iteratively. Under some restricted invertibility conditions on the maximum likelihood function and sparsity assumption on the target coefficients, the errors of the proposed estimate decays exponentially to the optimal order. Moreover, the oracle estimator can be recovered if the target signal is stronger than the detectable level. We conduct simulations and real data analysis to illustrate the advantages of our proposed method over several existing methods, including Lasso and MCP.


Better Boosting with Bandits for Online Learning

arXiv.org Machine Learning

The examples are considered to be of the form ( x i,y i), where x i is the feature vector of the i-th example and y i { 1, 1} is its class label. Extension to the multiclass case is often handled by breaking down the problem into multiple binary ones, so our analysis and its main results can carry over to the multiclass case. We consider the online setting where examples are presented to the learner in M minibatches 2 of size b. On the n -th iteration the learner performs the following steps: 1. Receive new examples x i, x i minibatch n 2. Predict the label ˆ y i and/or the probability estimate ˆ p(y i 1 x i), i minibatch n 3. Get true labels y i f ( x i), x i minibatch n, where f is the labelling function 4. Update learner parameters accordingly The steps above are intentionally left general enough to describe all learning components encountered in the paper. Our goal is to study the quality of the probability estimates generated by online boosting ensembles and strategies for improving it. Online boosting ensembles consist of multiple base learners, themselves also trained in an online fashion and -as we will seethe techniques used for improving the probability estimates (both the calibrator and the reward models of the bandits) are also learners trained in an online fashion. All follow the same general approach defined above: they maintain a model with a fixed number of parameters (i.e.