Goto

Collaborating Authors

 Genre


Affine Invariant Divergences associated with Composite Scores and its Applications

arXiv.org Machine Learning

In statistical analysis, measuring a score of predictive performance is an important task. In many scientific fields, appropriate scores were tailored to tackle the problems at hand. A proper score is a popular tool to obtain statistically consistent forecasts. Furthermore, a mathematical characterization of the proper score was studied. As a result, it was revealed that the proper score corresponds to a Bregman divergence, which is an extension of the squared distance over the set of probability distributions. In the present paper, we introduce composite scores as an extension of the typical scores in order to obtain a wider class of probabilistic forecasting. Then, we propose a class of composite scores, named Holder scores, that induce equivariant estimators. The equivariant estimators have a favorable property, implying that the estimator is transformed in a consistent way, when the data is transformed. In particular, we deal with the affine transformation of the data. By using the equivariant estimators under the affine transformation, one can obtain estimators that do no essentially depend on the choice of the system of units in the measurement. Conversely, we prove that the Holder score is characterized by the invariance property under the affine transformations. Furthermore, we investigate statistical properties of the estimators using Holder scores for the statistical problems including estimation of regression functions and robust parameter estimation, and illustrate the usefulness of the newly introduced scores for statistical forecasting.


Towards a theory of good SAT representations

arXiv.org Artificial Intelligence

We aim at providing a foundation of a theory of "good" SAT representations F of boolean functions f. We argue that the hierarchy UC_k of unit-refutation complete clause-sets of level k, introduced by the authors, provides the most basic target classes, that is, F in UC_k is to be achieved for k as small as feasible. If F does not contain new variables, i.e., F is equivalent (as a CNF) to f, then F in UC_1 is similar to "achieving (generalised) arc consistency" known from the literature (it is somewhat weaker, but theoretically much nicer to handle). We show that for polysize representations of boolean functions in this sense, the hierarchy UC_k is strict. The boolean functions for these separations are "doped" minimally unsatisfiable clause-sets of deficiency 1; these functions have been introduced in [Sloan, Soerenyi, Turan, 2007], and we generalise their construction and show a correspondence to a strengthened notion of irredundant sub-clause-sets. Turning from lower bounds to upper bounds, we believe that many common CNF representations fit into the UC_k scheme, and we give some basic tools to construct representations in UC_1 with new variables, based on the Tseitin translation. Note that regarding new variables the UC_1-representations are stronger than mere "arc consistency", since the new variables are not excluded from consideration.


Revisiting Bayesian Blind Deconvolution

arXiv.org Machine Learning

Blind deconvolution involves the estimation of a sharp signal or image given only a blurry observation. Because this problem is fundamentally ill-posed, strong priors on both the sharp image and blur kernel are required to regularize the solution space. While this naturally leads to a standard MAP estimation framework, performance is compromised by unknown trade-off parameter settings, optimization heuristics, and convergence issues stemming from non-convexity and/or poor prior selections. To mitigate some of these problems, a number of authors have recently proposed substituting a variational Bayesian (VB) strategy that marginalizes over the high-dimensional image space leading to better estimates of the blur kernel. However, the underlying cost function now involves both integrals with no closed-form solution and complex, function-valued arguments, thus losing the transparency of MAP. Beyond standard Bayesian-inspired intuitions, it thus remains unclear by exactly what mechanism these methods are able to operate, rendering understanding, improvements and extensions more difficult. To elucidate these issues, we demonstrate that the VB methodology can be recast as an unconventional MAP problem with a very particular penalty/prior that couples the image, blur kernel, and noise level in a principled way. This unique penalty has a number of useful characteristics pertaining to relative concavity, local minima avoidance, and scale-invariance that allow us to rigorously explain the success of VB including its existing implementational heuristics and approximations. It also provides strict criteria for choosing the optimal image prior that, perhaps counter-intuitively, need not reflect the statistics of natural scenes. In so doing we challenge the prevailing notion of why VB is successful for blind deconvolution while providing a transparent platform for introducing enhancements.


A Rank Minrelation - Majrelation Coefficient

arXiv.org Machine Learning

Improving the detection of relevant variables using a new bivariate measure could importantly impact variable selection and large network inference methods. In this paper, we propose a new statistical coefficient that we call the rank minrelation coefficient. We define a minrelation of X to Y (or equivalently a majrelation of Y to X) as a measure that estimate p(Y > X) when X and Y are continuous random variables. The approach is similar to Lin's concordance coefficient that rather focuses on estimating p(X = Y). In other words, if a variable X exhibits a minrelation to Y then, as X increases, Y is likely to increases too. However, on the contrary to concordance or correlation, the minrelation is not symmetric. More explicitly, if X decreases, little can be said on Y values (except that the uncertainty on Y actually increases). In this paper, we formally define this new kind of bivariate dependencies and propose a new statistical coefficient in order to detect those dependencies. We show through several key examples that this new coefficient has many interesting properties in order to select relevant variables, in particular when compared to correlation.


Joint Topic Modeling and Factor Analysis of Textual Information and Graded Response Data

arXiv.org Machine Learning

Modern machine learning methods are critical to the development of large-scale personalized learning systems that cater directly to the needs of individual learners. The recently developed SPARse Factor Analysis (SPARFA) framework provides a new statistical model and algorithms for machine learning-based learning analytics, which estimate a learner's knowledge of the latent concepts underlying a domain, and content analytics, which estimate the relationships among a collection of questions and the latent concepts. SPARFA estimates these quantities given only the binary-valued graded responses to a collection of questions. In order to better interpret the estimated latent concepts, SPARFA relies on a post-processing step that utilizes user-defined tags (e.g., topics or keywords) available for each question. In this paper, we relax the need for user-defined tags by extending SPARFA to jointly process both graded learner responses and the text of each question and its associated answer(s) or other feedback. Our purely data-driven approach (i) enhances the interpretability of the estimated latent concepts without the need of explicitly generating a set of tags or performing a post-processing step, (ii) improves the prediction performance of SPARFA, and (iii) scales to large test/assessments where human annotation would prove burdensome. We demonstrate the efficacy of the proposed approach on two real educational datasets.


Generalized Bregman Divergence and Gradient of Mutual Information for Vector Poisson Channels

arXiv.org Machine Learning

We investigate connections between information-theoretic and estimation-theoretic quantities in vector Poisson channel models. In particular, we generalize the gradient of mutual information with respect to key system parameters from the scalar to the vector Poisson channel model. We also propose, as another contribution, a generalization of the classical Bregman divergence that offers a means to encapsulate under a unifying framework the gradient of mutual information results for scalar and vector Poisson and Gaussian channel models. The so-called generalized Bregman divergence is also shown to exhibit various properties akin to the properties of the classical version. The vector Poisson channel model is drawing considerable attention in view of its application in various domains: as an example, the availability of the gradient of mutual information can be used in conjunction with gradient descent methods to effect compressive-sensing projection designs in emerging X-ray and document classification applications.


Inferring Team Strengths Using a Discrete Markov Random Field

arXiv.org Machine Learning

We propose an original model for inferring team strengths using a Markov Random Field, which can be used to generate historical estimates of the offensive and defensive strengths of a team over time. This model was designed to be applied to sports such as soccer or hockey, in which contest outcomes take value in a limited discrete space. We perform inference using a combination of Expectation Maximization and Loopy Belief Propagation. The challenges of working with a non-convex optimization problem and a high-dimensional parameter space are discussed. The performance of the model is demonstrated on professional soccer data from the English Premier League.


Consistency of Online Random Forests

arXiv.org Machine Learning

As a testament to their success, the theory of random forests has long been outpaced by their application in practice. In this paper, we take a step towards narrowing this gap by providing a consistency result for online random forests.


Kernelized Bayesian Matrix Factorization

arXiv.org Machine Learning

We extend kernelized matrix factorization with a fully Bayesian treatment and with an ability to work with multiple side information sources expressed as different kernels. Kernel functions have been introduced to matrix factorization to integrate side information about the rows and columns (e.g., objects and users in recommender systems), which is necessary for making out-of-matrix (i.e., cold start) predictions. We discuss specifically bipartite graph inference, where the output matrix is binary, but extensions to more general matrices are straightforward. We extend the state of the art in two key aspects: (i) A fully conjugate probabilistic formulation of the kernelized matrix factorization problem enables an efficient variational approximation, whereas fully Bayesian treatments are not computationally feasible in the earlier approaches. (ii) Multiple side information sources are included, treated as different kernels in multiple kernel learning that additionally reveals which side information sources are informative. Our method outperforms alternatives in predicting drug-protein interactions on two data sets. We then show that our framework can also be used for solving multilabel learning problems by considering samples and labels as the two domains where matrix factorization operates on. Our algorithm obtains the lowest Hamming loss values on 10 out of 14 multilabel classification data sets compared to five state-of-the-art multilabel learning algorithms.


Minimizing inter-subject variability in fNIRS based Brain Computer Interfaces via multiple-kernel support vector learning

arXiv.org Machine Learning

Brain signal variability in the measurements obtained from different subjects during different sessions significantly deteriorates the accuracy of most brain-computer interface (BCI) systems. Moreover these variabilities, also known as inter-subject or inter-session variabilities, require lengthy calibration sessions before the BCI system can be used. Furthermore, the calibration session has to be repeated for each subject independently and before use of the BCI due to the inter-session variability. In this study, we present an algorithm in order to minimize the above-mentioned variabilities and to overcome the time-consuming and usually error-prone calibration time. Our algorithm is based on linear programming support-vector machines and their extensions to a multiple kernel learning framework. We tackle the inter-subject or -session variability in the feature spaces of the classifiers. This is done by incorporating each subject- or session-specific feature spaces into much richer feature spaces with a set of optimal decision boundaries. Each decision boundary represents the subject- or a session specific spatio-temporal variabilities of neural signals. Consequently, a single classifier with multiple feature spaces will generalize well to new unseen test patterns even without the calibration steps. We demonstrate that classifiers maintain good performances even under the presence of a large degree of BCI variability. The present study analyzes BCI variability related to oxy-hemoglobin neural signals measured using a functional near-infrared spectroscopy.