Learning Graphical Models
Asymptotic consistency and order specification for logistic classifier chains in multi-label learning
Machine Learning manuscript No. (will be inserted by the editor)Asymptotic consistency and order specification for logistic classifier chains in multi-label learning Paweł T eisseyre Received: date / Accepted: date Abstract Classifier chains are popular and effective method to tackle a multi-label classification problem. The aim of this paper is to study the asymptotic properties of the chain model in which the conditional probabilities are of the logistic form. In particular we find conditions on the number of labels and the distribution of feature vector under which the estimated mode of the joint distribution of labels converges to the true mode. Best of our knowledge, this important issue has not yet been studied in the context of multi-label learning. We also investigate how the order of model building in a chain influences the estimation of the joint distribution of labels. We establish the link between the problem of incorrect ordering in the chain and incorrect model specification. We propose a procedure of determining the optimal ordering of labels in the chain, which is based on using measures of correct specification and allows to find the ordering such that the consecutive logistic models are best possibly specified. The other important question raised in this paper is how accurately can we estimate the joint posterior probability when the ordering of labels is wrong or the logistic models in the chain are incorrectly specified. The numerical experiments illustrate the theoretical results. Keywords classifier chains· logistic regression· joint mode estimation· label ordering· asymptotic consistency 1 Introduction In multi-label classification the task is to automatically assign an object to multiple categories based on its characteristics. Each object of our interest is described by a feature vector x belonging to p-dimensional space and vector of K labels y ( y 1,..., y K)′ . In this paper we consider binary labels such thaty k 1 indicates that the considered object belongs to k-th category or has the k-th property. The issue has recently attracted significant attention, motivated by an increasing number of applications such as image and video annotationPaweł Teisseyre Institute of Computer Science, Polish Academy of Sciences Jana Kazimierza 5 01-248 Warsaw, Poland Tel.: 48-22-380-05-55 Email: teisseyrep@ipipan.waw.pl
Feature ranking for multi-label classification using Markov Networks
We propose a simple and efficient method for ranking features in multi-label classification. The method produces a ranking of features showing their relevance in predicting labels, which in turn allows to choose a final subset of features. The procedure is based on Markov Networks and allows to model the dependencies between labels and features in a direct way. In the first step we build a simple network using only labels and then we test how much adding a single feature affects the initial network. More specifically, in the first step we use the Ising model whereas the second step is based on the score statistic, which allows to test a significance of added features very quickly. The proposed approach does not require transformation of label space, gives interpretable results and allows for attractive visualization of dependency structure. We give a theoretical justification of the procedure by discussing some theoretical properties of the Ising model and the score statistic. Numerical experiments show that the proposed methods outperform the conventional approaches on the considered artificial and real datasets. Introduction Multi-label classification (MLC) has recently attracted a significant attention, motivated by an increasing number of applications. More examples can be found in [22], [23] and [24]. The key problem in multi-label learning is how to utilize label dependencies to improve the classification performance, motivated by which number of multi-label algorithms have been proposed in recent years (see [25] for extensive comparison of several methods). The recent progress in MLC is summarized in [26] and [22]. In MLC, each object of our interest (e.g. One of the trending challenges in MLC is a dimensionality reduction of the feature space [22], i.e. reducing the dimensionality of the vector x. Usually only some features affect y.
Max-Margin Nonparametric Latent Feature Models for Link Prediction
Zhu, Jun, Song, Jiaming, Chen, Bei
Link prediction is a fundamental task in statistical network analysis. Recent advances have been made on learning flexible nonparametric Bayesian latent feature models for link prediction. In this paper, we present a max-margin learning method for such nonparametric latent feature relational models. Our approach attempts to unite the ideas of max-margin learning and Bayesian nonparametrics to discover discriminative latent features for link prediction. It inherits the advances of nonparametric Bayesian methods to infer the unknown latent social dimension, while for discriminative link prediction, it adopts the max-margin learning principle by minimizing a hinge-loss using the linear expectation operator, without dealing with a highly nonlinear link likelihood function. For posterior inference, we develop an efficient stochastic variational inference algorithm under a truncated mean-field assumption. Our methods can scale up to large-scale real networks with millions of entities and tens of millions of positive links. We also provide a full Bayesian formulation, which can avoid tuning regularization hyper-parameters. Experimental results on a diverse range of real datasets demonstrate the benefits inherited from max-margin learning and Bayesian nonparametric inference.
Recurrent Gaussian Processes
Mattos, César Lincoln C., Dai, Zhenwen, Damianou, Andreas, Forth, Jeremy, Barreto, Guilherme A., Lawrence, Neil D.
We define Recurrent Gaussian Processes (RGP) models, a general family of Bayesian nonparametric models with recurrent GP priors which are able to learn dynamical patterns from sequential data. Similar to Recurrent Neural Networks (RNNs), RGPs can have different formulations for their internal states, distinct inference methods and be extended with deep structures. In such context, we propose a novel deep RGP model whose autoregressive states are latent, thereby performing representation and dynamical learning simultaneously. To fully exploit the Bayesian nature of the RGP model we develop the Recurrent Variational Bayes (REVARB) framework, which enables efficient inference and strong regularization through coherent propagation of uncertainty across the RGP layers and states. We also introduce a RGP extension where variational parameters are greatly reduced by being reparametrized through RNN-based sequential recognition models. We apply our model to the tasks of nonlinear system identification and human motion modeling. The promising obtained results indicate that our RGP model maintains its highly flexibility while being able to avoid overfitting and being applicable even when larger datasets are not available.
Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures
Bernardi, Raffaella, Cakici, Ruket, Elliott, Desmond, Erdem, Aykut, Erdem, Erkut, Ikizler-Cinbis, Nazli, Keller, Frank, Muscat, Adrian, Plank, Barbara
Automatic description generation from natural images is a challenging problem that has recently received a large amount of interest from the computer vision and natural language processing communities. In this survey, we classify the existing approaches based on how they conceptualize this problem, viz., models that cast description as either generation problem or as a retrieval problem over a visual or multimodal representational space. We provide a detailed review of existing models, highlighting their advantages and disadvantages. Moreover, we give an overview of the benchmark image datasets and the evaluation measures that have been developed to assess the quality of machine-generated image descriptions.
Dynamic Filtering of Time-Varying Sparse Signals via l1 Minimization
Charles, Adam, Balavoine, Aurele, Rozell, Christopher
Despite the importance of sparsity signal models and the increasing prevalence of high-dimensional streaming data, there are relatively few algorithms for dynamic filtering of time-varying sparse signals. Of the existing algorithms, fewer still provide strong performance guarantees. This paper examines two algorithms for dynamic filtering of sparse signals that are based on efficient l1 optimization methods. We first present an analysis for one simple algorithm (BPDN-DF) that works well when the system dynamics are known exactly. We then introduce a novel second algorithm (RWL1-DF) that is more computationally complex than BPDN-DF but performs better in practice, especially in the case where the system dynamics model is inaccurate. Robustness to model inaccuracy is achieved by using a hierarchical probabilistic data model and propagating higher-order statistics from the previous estimate (akin to Kalman filtering) in the sparse inference process. We demonstrate the properties of these algorithms on both simulated data as well as natural video sequences. Taken together, the algorithms presented in this paper represent the first strong performance analysis of dynamic filtering algorithms for time-varying sparse signals as well as state-of-the-art performance in this emerging application.
Learning to classify with possible sensor failures
Xie, Tianpei, Nasrabadi, Nasser M., Hero, Alfred O.
Large margin classifiers, such as the support vector machine (SVM) [1] and the maximum entropy discrimination (MED) classifier [2], have enjoyed great popularity in the signal processing and machine learning communities due to their broad applicability, robust performance, and the availability of fast software implementations. When the training data is representative of the test data, the performance of MED/SVM has theoretical guarantees that have been validated in practice [1], [3], [4]. Moreover, since the decision boundary of the MED/SVM is solely defined by a few support vectors, the algorithm can tolerate random feature distortions and perturbations. However, in many real applications, anomalous measurements are inherent to the data set due to strong environmental noise or possible sensor failures. Such anomalies arise in industrial process monitoring, video surveillance, tactical multi-modal sensing, and, more generally, any application that involves unattended sensors in difficult environments (Figure 1).
Statistical Mechanics of High-Dimensional Inference
To model modern large-scale datasets, we need efficient algorithms to infer a set of $P$ unknown model parameters from $N$ noisy measurements. What are fundamental limits on the accuracy of parameter inference, given finite signal-to-noise ratios, limited measurements, prior information, and computational tractability requirements? How can we combine prior information with measurements to achieve these limits? Classical statistics gives incisive answers to these questions as the measurement density $\alpha = \frac{N}{P}\rightarrow \infty$. However, these classical results are not relevant to modern high-dimensional inference problems, which instead occur at finite $\alpha$. We formulate and analyze high-dimensional inference as a problem in the statistical physics of quenched disorder. Our analysis uncovers fundamental limits on the accuracy of inference in high dimensions, and reveals that widely cherished inference algorithms like maximum likelihood (ML) and maximum-a posteriori (MAP) inference cannot achieve these limits. We further find optimal, computationally tractable algorithms that can achieve these limits. Intriguingly, in high dimensions, these optimal algorithms become computationally simpler than MAP and ML, while still outperforming them. For example, such optimal algorithms can lead to as much as a 20% reduction in the amount of data to achieve the same performance relative to MAP. Moreover, our analysis reveals simple relations between optimal high dimensional inference and low dimensional scalar Bayesian inference, insights into the nature of generalization and predictive power in high dimensions, information theoretic limits on compressed sensing, phase transitions in quadratic inference, and connections to central mathematical objects in convex optimization theory and random matrix theory.
Efficient functional ANOVA through wavelet-domain Markov groves
We introduce a wavelet-domain functional analysis of variance (fANOVA) method based on a Bayesian hierarchical model. The factor effects are modeled through a spike-and-slab mixture at each location-scale combination along with a normal-inverse-Gamma (NIG) conjugate setup for the coefficients and errors. A graphical model called the Markov grove (MG) is designed to jointly model the spike-and-slab statuses at all location-scale combinations, which incorporates the clustering of each factor effect in the wavelet-domain thereby allowing borrowing of strength across location and scale. The posterior of this NIG-MG model is analytically available through a pyramid algorithm of the same computational complexity as Mallat's pyramid algorithm for discrete wavelet transform, i.e., linear in both the number of observations and the number of locations. Posterior probabilities of factor contributions can also be computed through pyramid recursion, and exact samples from the posterior can be drawn without MCMC. We investigate the performance of our method through extensive simulation and show that it outperforms existing wavelet-domain fANOVA methods in a variety of common settings. We apply the method to analyzing the orthosis data.
Stratified Bayesian Optimization
Toscano-Palmerin, Saul, Frazier, Peter I.
We suppose that f has no special structural properties, e.g., concavity, or linearity, that we can exploit to solve this problem, making it a "black blox." We also suppose that evaluating f is costly or time-consuming, making these evaluations "expensive", severely limiting the number of evaluations we may perform. This typically occurs because each evaluation requires running a complex PDE-based or discrete-event simulation, or requires training a machine learning algorithm on a large dataset. When f comes from a discrete-event simulation, this problem is also called "simulation optimization." Bayesian optimization is a popular class of techniques for solving this problem, originating with the seminal paper (Kushner, 1964), and enjoying early contributions from (Mockus et al., 1978; Mockus, 1989). This class of techniques was popularized in the 1990s by the introduction in (Jones et al., 1998) of the most well-known Bayesian optimization method, Efficient Global Optimization (EGO), relying on earlier ideas from (Mockus, 1989). Recently the machine learning community has devoted considerable attention to Bayesian optimization for its applications to tuning computationally intensive machine learning models, as in, e.g., (Snoek et al., 2012). Textbooks and surveys on Bayesian optimization include (Forrester et al., 2008; Brochu et al., 2010). Most work on Bayesian optimization assumes we can observe the objective function directly without noise, but a substantial number of papers, e.g.