Statistical Learning
Multi-label ensemble based on variable pairwise constraint projection
Multi-label classification has attracted an increasing amount of attention in recent years. To this end, many algorithms have been developed to classify multi-label data in an effective manner. However, they usually do not consider the pairwise relations indicated by sample labels, which actually play important roles in multi-label classification. Inspired by this, we naturally extend the traditional pairwise constraints to the multi-label scenario via a flexible thresholding scheme. Moreover, to improve the generalization ability of the classifier, we adopt a boosting-like strategy to construct a multi-label ensemble from a group of base classifiers. To achieve these goals, this paper presents a novel multi-label classification framework named Variable Pairwise Constraint projection for Multi-label Ensemble (VPCME). Specifically, we take advantage of the variable pairwise constraint projection to learn a lower-dimensional data representation, which preserves the correlations between samples and labels. Thereafter, the base classifiers are trained in the new data space. For the boosting-like strategy, we employ both the variable pairwise constraints and the bootstrap steps to diversify the base classifiers. Empirical studies have shown the superiority of the proposed method in comparison with other approaches.
Compressive Nonparametric Graphical Model Selection For Time Series
Jung, Alexander, Heckel, Reinhard, Bรถlcskei, Helmut, Hlawatsch, Franz
Here, h[m] is a nonnegative weight function that typically increases with m . The CIG of the process x[n] is the graph G: (V, E) with node set V [p]: {1,..., p} representing the scalar component processes {x ABSTRACT We propose a method for inferring the conditional independence graph (CIG) of a high-dimensional discrete-time Gaussian vector random process from finite-length observations. Our approach does not rely on a parametric model (such as, e.g., an autoregressive model) for the vector random process; rather, it only assumes certain spectral smoothness properties. The proposed inference scheme is compressive in that it works for sample sizes that are (much) smaller than the number of scalar process components. We provide analytical conditions for our method to correctly identify the CIG with high probability.
Fast Distribution To Real Regression
Oliva, Junier B., Neiswanger, Willie, Poczos, Barnabas, Schneider, Jeff, Xing, Eric
We study the problem of distribution to real-value regression, where one aims to regress a mapping $f$ that takes in a distribution input covariate $P\in \mathcal{I}$ (for a non-parametric family of distributions $\mathcal{I}$) and outputs a real-valued response $Y=f(P) + \epsilon$. This setting was recently studied, and a "Kernel-Kernel" estimator was introduced and shown to have a polynomial rate of convergence. However, evaluating a new prediction with the Kernel-Kernel estimator scales as $\Omega(N)$. This causes the difficult situation where a large amount of data may be necessary for a low estimation risk, but the computation cost of estimation becomes infeasible when the data-set is too large. To this end, we propose the Double-Basis estimator, which looks to alleviate this big data problem in two ways: first, the Double-Basis estimator is shown to have a computation complexity that is independent of the number of of instances $N$ when evaluating new predictions after training; secondly, the Double-Basis estimator is shown to have a fast rate of convergence for a general class of mappings $f\in\mathcal{F}$.
FuSSO: Functional Shrinkage and Selection Operator
Oliva, Junier B., Poczos, Barnabas, Verstynen, Timothy, Singh, Aarti, Schneider, Jeff, Yeh, Fang-Cheng, Tseng, Wen-Yih
We present the FuSSO, a functional analogue to the LASSO, that efficiently finds a sparse set of functional input covariates to regress a real-valued response against. The FuSSO does so in a semi-parametric fashion, making no parametric assumptions about the nature of input functional covariates and assuming a linear form to the mapping of functional covariates to the response. We provide a statistical backing for use of the FuSSO via proof of asymptotic sparsistency under various conditions. Furthermore, we observe good results on both synthetic and real-world data.
Becoming More Robust to Label Noise with Classifier Diversity
Smith, Michael R., Martinez, Tony
It is widely known in the machine learning community that class noise can be (and often is) detrimental to inducing a model of the data. Many current approaches use a single, often biased, measurement to determine if an instance is noisy. A biased measure may work well on certain data sets, but it can also be less effective on a broader set of data sets. In this paper, we present noise identification using classifier diversity (NICD) -- a method for deriving a less biased noise measurement and integrating it into the learning process. To lessen the bias of the noise measure, NICD selects a diverse set of classifiers (based on their predictions of novel instances) to determine which instances are noisy. We examine NICD as a technique for filtering, instance weighting, and selecting the base classifiers of a voting ensemble. We compare NICD with several other noise handling techniques that do not consider classifier diversity on a set of 54 data sets and 5 learning algorithms. NICD significantly increases the classification accuracy over the other considered approaches and is effective across a broad set of data sets and learning algorithms.
Collaborative Filtering with Information-Rich and Information-Sparse Entities
Zhu, Kai, Wu, Rui, Ying, Lei, Srikant, R.
In this paper, we consider a popular model for collaborative filtering in recommender systems where some users of a website rate some items, such as movies, and the goal is to recover the ratings of some or all of the unrated items of each user. In particular, we consider both the clustering model, where only users (or items) are clustered, and the co-clustering model, where both users and items are clustered, and further, we assume that some users rate many items (information-rich users) and some users rate only a few items (information-sparse users). When users (or items) are clustered, our algorithm can recover the rating matrix with $\omega(MK \log M)$ noisy entries while $MK$ entries are necessary, where $K$ is the number of clusters and $M$ is the number of items. In the case of co-clustering, we prove that $K^2$ entries are necessary for recovering the rating matrix, and our algorithm achieves this lower bound within a logarithmic factor when $K$ is sufficiently large. We compare our algorithms with a well-known algorithms called alternating minimization (AM), and a similarity score-based algorithm known as the popularity-among-friends (PAF) algorithm by applying all three to the MovieLens and Netflix data sets. Our co-clustering algorithm and AM have similar overall error rates when recovering the rating matrix, both of which are lower than the error rate under PAF. But more importantly, the error rate of our co-clustering algorithm is significantly lower than AM and PAF in the scenarios of interest in recommender systems: when recommending a few items to each user or when recommending items to users who only rated a few items (these users are the majority of the total user population). The performance difference increases even more when noise is added to the datasets.
Retrieval of Experiments with Sequential Dirichlet Process Mixtures in Model Space
Dutta, Ritabrata, Seth, Sohan, Kaski, Samuel
We address the problem of retrieving relevant experiments given a query experiment, motivated by the public databases of datasets in molecular biology and other experimental sciences, and the need of scientists to relate to earlier work on the level of actual measurement data. Since experiments are inherently noisy and databases ever accumulating, we argue that a retrieval engine should possess two particular characteristics. First, it should compare models learnt from the experiments rather than the raw measurements themselves: this allows incorporating experiment-specific prior knowledge to suppress noise effects and focus on what is important. Second, it should be updated sequentially from newly published experiments, without explicitly storing either the measurements or the models, which is critical for saving storage space and protecting data privacy: this promotes life long learning. We formulate the retrieval as a ``supermodelling'' problem, of sequentially learning a model of the set of posterior distributions, represented as sets of MCMC samples, and suggest the use of Particle-Learning-based sequential Dirichlet process mixture (DPM) for this purpose. The relevance measure for retrieval is derived from the supermodel through the mixture representation. We demonstrate the performance of the proposed retrieval method on simulated data and molecular biological experiments.
Nonlinear hyperspectral unmixing with robust nonnegative matrix factorization
Fรฉvotte, Cรฉdric, Dobigeon, Nicolas
Abstract--This paper introduces a robust mixing model to describe hyperspectral data resulting from the mixture of several pure spectral signatures. This new model not only generalizes the commonly used linear mixing model, but also allows for possible nonlinear effects to be easily handled, relying on mild assumptions regarding these nonlinearities. The standard nonnegativity and sum-to-one constraints inherent to spectral unmixing are coupled with a group-sparse constraint imposed on the nonlinearity component. The data fidelity term is expressed as a ฮฒ -divergence, a continuous family of dissimilarity measures that takes the squared Euclidean distance and the generalized Kullback-Leibler divergence as special cases. The penalized objective is minimized with a block-coordinate descent that involves majorization-minimization updates. Simulation results obtained on synthetic and real data show that the proposed strategy competes with state-of-the-art linear and nonlinear unmixing methods. Spectral unmixing (SU) is an issue of prime interest when analyzing hyperspectral data since it provides a comprehensive and meaningful description of the collected measurements in various application fields including remote sensing [1], planetology [2], food monitoring [3] or spectro-microscopy [4]. Most of the hyperspectral unmixing algorithms proposed in the signal & image processing and geoscience literatures rely on the commonly admitted linear mixing model (LMM),Y MA . Indeed, LMM provides a good approximation of the physical process underlying the observations and has resulted in interesting results for most applications. However, for several specific applications, LMM may be inaccurate and other nonlinear models need to be advocated [7]. For instance, in remotely sensed images composed of vegetation (e.g., trees), interactions of photons with multiple components of the scene lead to nonlinear effects that can be taken into account N. Dobigeon is with University of Toulouse, IRIT/INP-ENSEEIHT, 2 rue Camichel, BP 7122, 31071 Toulouse cedex 7, France.
Collaborative Representation for Classification, Sparse or Non-sparse?
Wu, Yang, Jarich, Vansteenberge, Mukunoki, Masayuki, Minoh, Michihiko
Sparse representation based classification (SRC) has been proved to be a simple, effective and robust solution to face recognition. As it gets popular, doubts on the necessity of enforcing sparsity starts coming up, and primary experimental results showed that simply changing the $l_1$-norm based regularization to the computationally much more efficient $l_2$-norm based non-sparse version would lead to a similar or even better performance. However, that's not always the case. Given a new classification task, it's still unclear which regularization strategy (i.e., making the coefficients sparse or non-sparse) is a better choice without trying both for comparison. In this paper, we present as far as we know the first study on solving this issue, based on plenty of diverse classification experiments. We propose a scoring function for pre-selecting the regularization strategy using only the dataset size, the feature dimensionality and a discrimination score derived from a given feature representation. Moreover, we show that when dictionary learning is taking into account, non-sparse representation has a more significant superiority to sparse representation. This work is expected to enrich our understanding of sparse/non-sparse collaborative representation for classification and motivate further research activities.
Discriminative Functional Connectivity Measures for Brain Decoding
Firat, Orhan, Ozay, Mete, Oztekin, Ilke, Vural, Fatos T. Yarman
We propose a statistical learning model for classifying cognitive processes based on distributed patterns of neural activation in the brain, acquired via functional magnetic resonance imaging (fMRI). In the proposed learning method, local meshes are formed around each voxel. The distance between voxels in the mesh is determined by using a functional neighbourhood concept. In order to define the functional neighbourhood, the similarities between the time series recorded for voxels are measured and functional connectivity matrices are constructed. Then, the local mesh for each voxel is formed by including the functionally closest neighbouring voxels in the mesh. The relationship between the voxels within a mesh is estimated by using a linear regression model. These relationship vectors, called Functional Connectivity aware Local Relational Features (FC-LRF) are then used to train a statistical learning machine. The proposed method was tested on a recognition memory experiment, including data pertaining to encoding and retrieval of words belonging to ten different semantic categories. Two popular classifiers, namely k-nearest neighbour (k-nn) and Support Vector Machine (SVM), are trained in order to predict the semantic category of the item being retrieved, based on activation patterns during encoding. The classification performance of the Functional Mesh Learning model, which range in 62%-71% is superior to the classical multi-voxel pattern analysis (MVPA) methods, which range in 40%-48%, for ten semantic categories.