Goto

Collaborating Authors

 Bayesian Inference


Active Recursive Bayesian Inference with Posterior Trajectory Analysis Using $\alpha$-Divergence

arXiv.org Machine Learning

Recursive Bayesian inference (RBI) provides optimal Bayesian latent variable estimates in real-time settings with streaming noisy observations. Active RBI attempts to effectively select queries that lead to more informative observations to rapidly reduce uncertainty until a confident decision is made. However, typically the optimality objectives of inference and query mechanisms are not jointly selected. Furthermore, conventional active querying methods stagger due to misleading prior information. Motivated by information theoretic approaches, we propose an active RBI framework with unified inference and query selection steps through Renyi entropy and $\alpha$-divergence. We also propose a new objective based on Renyi entropy and its changes called Momentum that encourages exploration for misleading prior cases. The proposed active RBI framework is applied to the trajectory of the posterior changes in the probability simplex that provides a coordinated active querying and decision making with specified confidence. Under certain assumptions, we analytically demonstrate that the proposed approach outperforms conventional methods such as mutual information by allowing the selections of unlikely events. We present empirical and experimental performance evaluations on two applications: restaurant recommendation and brain-computer interface (BCI) typing systems.


Repulsive Mixture Models of Exponential Family PCA for Clustering

arXiv.org Machine Learning

The mixture extension of exponential family principal component analysis (EPCA) was designed to encode much more structural information about data distribution than the traditional EPCA does. For example, due to the linearity of EPCA's essential form, nonlinear cluster structures cannot be easily handled, but they are explicitly modeled by the mixing extensions. However, the traditional mixture of local EPCAs has the problem of model redundancy, i.e., overlaps among mixing components, which may cause ambiguity for data clustering. To alleviate this problem, in this paper, a repulsiveness-encouraging prior is introduced among mixing components and a diversified EPCA mixture (DEPCAM) model is developed in the Bayesian framework. Specifically, a determinantal point process (DPP) is exploited as a diversity-encouraging prior distribution over the joint local EPCAs. As required, a matrix-valued measure for L-ensemble kernel is designed, within which, $\ell_1$ constraints are imposed to facilitate selecting effective PCs of local EPCAs, and angular based similarity measure are proposed. An efficient variational EM algorithm is derived to perform parameter learning and hidden variable inference. Experimental results on both synthetic and real-world datasets confirm the effectiveness of the proposed method in terms of model parsimony and generalization ability on unseen test data.


Direct loss minimization for sparse Gaussian processes

arXiv.org Machine Learning

The Gaussian process (GP) is an attractive Bayesian model for machine learning which combines an elegant formulation with model flexibility and uncertainty quantification. Sparse Gaussian process (sGP) algorithms provide an approximate solution that mitigates the high computational complexity of GP and the variational approximation is the current best practice for such approximations. Recent theoretical work has shown that an alternative approach, direct loss minimization (DLM), which directly minimizes predictive loss, comes with strong guarantees on the expected loss of the algorithm. In this paper we explore this approach experimentally. We develop the DLM algorithm for sGP and show that with appropriate hyperparameter optimization it provides a significant improvement over the variational approach. In particular, optimizing sGP for log loss provides better calibrated predictions for regression, classification and count prediction, and optimizing sGP for square loss improves the mean square error in regression.


A High-Performance Implementation of Bayesian Matrix Factorization with Limited Communication

arXiv.org Machine Learning

Matrix factorization is a very common machine learning technique in recommender systems. Bayesian Matrix Factorization (BMF) algorithms would be attractive because of their ability to quantify uncertainty in their predictions and avoid over-fitting, combined with high prediction accuracy. However, they have not been widely used on large-scale data because of their prohibitive computational cost. In recent work, efforts have been made to reduce the cost, both by improving the scalability of the BMF algorithm as well as its implementation, but so far mainly separately. In this paper we show that the state-of-the-art of both approaches to scalability can be combined. We combine the recent highly-scalable Posterior Propagation algorithm for BMF, which parallelizes computation of blocks of the matrix, with a distributed BMF implementation that users asynchronous communication within each block. We show that the combination of the two methods gives substantial improvements in the scalability of BMF on web-scale datasets, when the goal is to reduce the wall-clock time.


B-SCST: Bayesian Self-Critical Sequence Training for Image Captioning

arXiv.org Machine Learning

Bayesian deep neural networks (DNN) provide a mathematically grounded framework to quantify uncertainty in their predictions. We propose a Bayesian variant of policy-gradient based reinforcement learning training technique for image captioning models to directly optimize non-differentiable image captioning quality metrics such as CIDEr-D. We extend the well-known Self-Critical Sequence Training (SCST) approach for image captioning models by incorporating Bayesian inference, and refer to it as B-SCST. The "baseline" reward for the policy-gradients in B-SCST is generated by averaging predictive quality metrics (CIDEr-D) of the captions drawn from the distribution obtained using a Bayesian DNN model. This predictive distribution is inferred using Monte Carlo (MC) dropout, which is one of the standard ways to approximate variational inference. We observe that B-SCST improves all the standard captioning quality scores on both Flickr30k and MS COCO datasets, compared to the SCST approach. We also provide a detailed study of uncertainty quantification for the predicted captions, and demonstrate that it correlates well with the CIDEr-D scores. To our knowledge, this is the first such analysis, and it can pave way to more practical image captioning solutions with interpretable models.


The Bethe and Sinkhorn Permanents of Low Rank Matrices and Implications for Profile Maximum Likelihood

arXiv.org Machine Learning

In this paper we consider the problem of computing the likelihood of the profile of a discrete distribution, i.e., the probability of observing the multiset of element frequencies, and computing a profile maximum likelihood (PML) distribution, i.e., a distribution with the maximum profile likelihood. For each problem we provide polynomial time algorithms that given $n$ i.i.d.\ samples from a discrete distribution, achieve an approximation factor of $\exp\left(-O(\sqrt{n} \log n) \right)$, improving upon the previous best-known bound achievable in polynomial time of $\exp(-O(n^{2/3} \log n))$ (Charikar, Shiragur and Sidford, 2019). Through the work of Acharya, Das, Orlitsky and Suresh (2016), this implies a polynomial time universal estimator for symmetric properties of discrete distributions in a broader range of error parameter. We achieve these results by providing new bounds on the quality of approximation of the Bethe and Sinkhorn permanents (Vontobel, 2012 and 2014). We show that each of these are $\exp(O(k \log(N/k)))$ approximations to the permanent of $N \times N$ matrices with non-negative rank at most $k$, improving upon the previous known bounds of $\exp(O(N))$. To obtain our results on PML, we exploit the fact that the PML objective is proportional to the permanent of a certain Vandermonde matrix with $\sqrt{n}$ distinct columns, i.e. with non-negative rank at most $\sqrt{n}$. As a by-product of our work we establish a surprising connection between the convex relaxation in prior work (CSS19) and the well-studied Bethe and Sinkhorn approximations.


Natural language processing for word sense disambiguation and information extraction

arXiv.org Artificial Intelligence

This research work deals with Natural Language Processing (NLP) and extraction of essential information in an explicit form. The most common among the information management strategies is Document Retrieval (DR) and Information Filtering. DR systems may work as combine harvesters, which bring back useful material from the vast fields of raw material. With large amount of potentially useful information in hand, an Information Extraction (IE) system can then transform the raw material by refining and reducing it to a germ of original text. A Document Retrieval system collects the relevant documents carrying the required information, from the repository of texts. An IE system then transforms them into information that is more readily digested and analyzed. It isolates relevant text fragments, extracts relevant information from the fragments, and then arranges together the targeted information in a coherent framework. The thesis presents a new approach for Word Sense Disambiguation using thesaurus. The illustrative examples supports the effectiveness of this approach for speedy and effective disambiguation. A Document Retrieval method, based on Fuzzy Logic has been described and its application is illustrated. A question-answering system describes the operation of information extraction from the retrieved text documents. The process of information extraction for answering a query is considerably simplified by using a Structured Description Language (SDL) which is based on cardinals of queries in the form of who, what, when, where and why. The thesis concludes with the presentation of a novel strategy based on Dempster-Shafer theory of evidential reasoning, for document retrieval and information extraction. This strategy permits relaxation of many limitations, which are inherent in Bayesian probabilistic approach.


A Bayesian approach for initialization of weights in backpropagation neural net with application to character recognition

arXiv.org Machine Learning

Convergence rate of training algorithms for neural networks is heavily affected by initialization of weights. In this paper, an original algorithm for initialization of weights in backpropagation neural net is presented with application to character recognition. The initialization method is mainly based on a customization of the Kalman filter, translating it into Bayesian statistics terms. A metrological approach is used in this context considering weights as measurements modeled by mutually dependent normal random variables. The algorithm performance is demonstrated by reporting and discussing results of simulation trials. Results are compared with random weights initialization and other methods. The proposed method shows an improved convergence rate for the backpropagation training algorithm.


Conversational Question Reformulation via Sequence-to-Sequence Architectures and Pretrained Language Models

arXiv.org Artificial Intelligence

This paper presents an empirical study of conversational question reformulation (CQR) with sequence-to-sequence architectures and pretrained language models (PLMs). We leverage PLMs to address the strong token-to-token independence assumption made in the common objective, maximum likelihood estimation, for the CQR task. In CQR benchmarks of task-oriented dialogue systems, we evaluate fine-tuned PLMs on the recently-introduced CANARD dataset as an in-domain task and validate the models using data from the TREC 2019 CAsT Track as an out-domain task. Examining a variety of architectures with different numbers of parameters, we demonstrate that the recent text-to-text transfer transformer (T5) achieves the best results both on CANARD and CAsT with fewer parameters, compared to similar transformer architectures.


Stacked Generalizations in Imbalanced Fraud Data Sets using Resampling Methods

arXiv.org Machine Learning

This study uses stacked generalization, which is a two-step process of combining machine learning methods, called meta or super learners, for improving the performance of algorithms in step one (by minimizing the error rate of each individual algorithm to reduce its bias in the learning set) and then in step two inputting the results into the meta learner with its stacked blended output (demonstrating improved performance with the weakest algorithms learning better). The method is essentially an enhanced cross-validation strategy. Although the process uses great computational resources, the resulting performance metrics on resampled fraud data show that increased system cost can be justified. A fundamental key to fraud data is that it is inherently not systematic and, as of yet, the optimal resampling methodology has not been identified. Building a test harness that accounts for all permutations of algorithm sample set pairs demonstrates that the complex, intrinsic data structures are all thoroughly tested. Using a comparative analysis on fraud data that applies stacked generalizations provides useful insight needed to find the optimal mathematical formula to be used for imbalanced fraud data sets.