Bayesian Inference
Smart PCA
Zhang, Yi (Carnegie Mellon University)
PCA can be smarter and makes more sensible projections. In this paper, we propose smart PCA, an extension to standard PCA to regularize and incorporate external knowledge into model estimation. Based on the probabilistic interpretation of PCA, the inverse Wishart distribution can be used as the informative conjugate prior for the population covariance, and useful knowledge is carried by the prior hyperparameters. We design the hyperparameters to smoothly combine the information from both the domain knowledge and the data itself. The Bayesian point estimation of principal components is in closed form. In empirical studies, smart PCA shows clear improvement on three different criteria: image reconstruction errors, the perceptual quality of the reconstructed images, and the pattern recognition performance.
Multi-Relational Learning with Gaussian Processes
Xu, Zhao (Fraunhofer IAIS) | Kersting, Kristian (Fraunhofer IAIS) | Tresp, Volker (Siemens Corporate Technology)
Due to their flexible nonparametric nature, Gaussian process models are very effective at solving hard machine learning problems. While existing Gaussian process models focus on modeling one single relation, we present a generalized GP model, named multi-relational Gaussian process model, that is able to deal with an arbitrary number of relations in a domain of interest. The proposed model is analyzed in the context of bipartite, directed, and undirected univariate relations. Experimental results on real-world datasets show that exploiting the correlations among different entity types and relations can indeed improve prediction performance.
Preference Learning with Extreme Examples
Wang, Fei (Florida International University) | Zhang, Bin (IBM CRL) | Li, Ta-Hsin (IBM T. J. Watson) | Yin, Wenjun (IBM CRL) | Dong, Jin (IBM CRL) | Li, Tao (Florida International University)
In this paper, we consider a general problem of semi-supervised preference learning, in which we assume that we have the information of the extreme cases and some ordered constraints, our goal is to learn the unknown preferences of the other places. Taking the potential housing place selection problem as an example, we have many candidate places together with their associated information (e.g., position, environment), and we know some extreme examples (i.e., several places are perfect for building a house, and several places are the worst that cannot build a house there), and we know some partially ordered constraints (i.e., for two places, which place is better), then how can we judge the preference of one potential place whose preference is unknown beforehand? We propose a Bayesian framework based on Gaussian process to tackle this problem, from which we not only solve for the unknown preferences, but also the hyperparameters contained in our model.
Semi-Supervised Classification using Sparse Gaussian Process Regression
Patel, Amrish (Indian Institute of Science) | Sundararajan, S. (Yahoo! Labs) | Shevade, Shirish (Indian Institute of Science)
Gaussian Processes (GPs) are promising Bayesian methods for classification and regression problems. They have also been used for semi-supervised learning tasks. In this paper, we propose a new algorithm for solving semi-supervised binary classification problem using sparse GP regression (GPR) models. It is closely related to semi-supervised learning based on support vector regression (SVR) and maximum margin clustering. The proposed algorithm is simple and easy to implement. It gives a sparse solution directly unlike the SVR based algorithm. Also, the hyperparameters are estimated easily without resorting to expensive cross-validation technique. Use of sparse GPR model helps in making the proposed algorithm scalable. Preliminary results on synthetic and real-world data sets demonstrate the efficacy of the new algorithm.
Bayesian Extreme Components Analysis
Chen, Yutian (University of California, Irvine) | Welling, Max (University of California, Irvine)
Extreme Components Analysis (XCA) is a statistical method based on a single eigenvalue decomposition to recover the optimal combination of principal and minor components in the data. Unfortunately, minor components are notoriously sensitive to overfitting when the number of data items is small relative to the number of attributes. We present a Bayesian extension of XCA by introducing a conjugate prior for the parameters of the XCA model. This Bayesian-XCA is shown to outperform plain vanilla XCA as well as Bayesian-PCA and XCA based on a frequentist correction to the sample spectrum. Moreover, we show that minor components are only picked when they represent genuine constraints in the data, even for very small sample sizes. An extension to mixtures of Bayesian XCA models is also explored.
A New Bayesian Approach to Multiple Intermittent Fault Diagnosis
Abreu, Rui (Delft University of Technology) | Zoeteweij, Peter (Delft University of Technology) | Gemund, Arjan J.C. van (Delft University of Technology)
Logic reasoning approaches to fault diagnosis account for the fact that a component c j may fail intermittently by introducing a parameter g j that expresses the probability the component exhibits correct behavior. This component parameter g j , in conjunction with a priori fault probability, is usedin a Bayesian framework to compute the posterior fault candidate probabilities. Usually, information on g j is not known a priori. While proper estimation of g j can have a great impact on the diagnostic accuracy, at present, only approximations have been proposed. We present a novel framework, BARINEL, that computes exact estimations of g j as integral part of the posterior candidate probability computation. BARINELโs diagnostic performance is evaluated for both synthetic and real software systems. Our results show that our approach is superior to approaches based on classical persistent fault models as well as previously proposed intermittent fault models.
Preference Functions That Score Rankings and Maximum Likelihood Estimation
Conitzer, Vincent (Duke University) | Rognlie, Matthew (Duke University) | Xia, Lirong (Duke University)
In social choice, a preference function (PF) takes a set of votes (linear orders over a set of alternatives) as input, and produces one or more rankings (also linear orders over the alternatives) as output. Such functions have many applications, for example, aggregating the preferences of multiple agents, or merging rankings (of, say, webpages) into a single ranking. The key issue is choosing a PF to use. One natural and previously studied approach is to assume that there is an unobserved "correct" ranking, and the votes are noisy estimates of this. Then, we can use the PF that always chooses the maximum likelihood estimate (MLE) of the correct ranking. In this paper, we define simple ranking scoring functions (SRSFs) and show that the class of neutral SRSFs is exactly the class of neutral PFs that are MLEs for some noise model. We also define composite ranking scoring functions (CRSFs) and show a condition under which these coincide with SRSFs. We study key properties such as consistency and continuity, and consider some example PFs. In particular, we study Single Transferable Vote (STV), a commonly used PF, showing that it is a CRSF but not an SRSF, thereby clarifying the extent to which it is an MLE function. This also gives a new perspective on how ties should be broken under STV. We leave some open questions.
Activity Recognition: Linking Low-Level Sensors to High-Level Intelligence
Yang, Qiang (Hong Kong Hong Kong University of Science and Technology)
Sensors provide computer systems with a window to the outside world. Activity recognition "sees" what is in the window to predict the locations, trajectories, actions, goals and plans of humans and objects. Building an activity recognition system requires a full range of interaction from statistical inference on lower level sensor data to symbolic AI at higher levels, where prediction results and acquired knowledge are passed up each level to form a knowledge food chain. In this article, I will give an overview of some of the current activity recognition research works and explore a life-cycle of learning and inference that allows the lowest-level radio-frequency signals to be transformed into symbolic logical representations for AI planning, which in turn controls the robots or guides human users through a sensor network, thus completing a full life-cycle of knowledge.
Conditional Probability Tree Estimation Analysis and Algorithms
Beygelzimer, Alina, Langford, John, Lifshits, Yuri, Sorkin, Gregory, Strehl, Alex
We consider the problem of estimating the conditional probability of a label in time $O(\log n)$, where $n$ is the number of possible labels. We analyze a natural reduction of this problem to a set of binary regression problems organized in a tree structure, proving a regret bound that scales with the depth of the tree. Motivated by this analysis, we propose the first online algorithm which provably constructs a logarithmic depth tree on the set of labels to solve this problem. We test the algorithm empirically, showing that it works succesfully on a dataset with roughly $10^6$ labels.
Characterizing predictable classes of processes
The problem is sequence prediction in the following setting. A sequence $x_1,...,x_n,...$ of discrete-valued observations is generated according to some unknown probabilistic law (measure) $\mu$. After observing each outcome, it is required to give the conditional probabilities of the next observation. The measure $\mu$ belongs to an arbitrary class $\C$ of stochastic processes. We are interested in predictors $\rho$ whose conditional probabilities converge to the "true" $\mu$-conditional probabilities if any $\mu\in\C$ is chosen to generate the data. We show that if such a predictor exists, then a predictor can also be obtained as a convex combination of a countably many elements of $\C$. In other words, it can be obtained as a Bayesian predictor whose prior is concentrated on a countable set. This result is established for two very different measures of performance of prediction, one of which is very strong, namely, total variation, and the other is very weak, namely, prediction in expected average Kullback-Leibler divergence.