Goto

Collaborating Authors

 Unsupervised or Indirectly Supervised Learning


CURL: Co-trained Unsupervised Representation Learning for Image Classification

arXiv.org Machine Learning

Abstract--In this paper we propose a strategy for semi-supervised image classification that leverages unsupervised representation learning and co-training. The strategy, that is called CURL from Co-trained Unsupervised Representation Learning, iteratively builds two classifiers on two different views of the data. The two views correspond to different representations learned from both labeled and unlabeled data and differ in the fusion scheme used to combine the image features. T o assess the performance of our proposal, we conducted several experiments on widely used data sets for scene and object recognition. We considered three scenarios (inductive, transductive and self-taught learning) that differ in the strategy followed to exploit the unlabeled data. As image features we considered a combination of GIST, PHOG, and LBP as well as features extracted from a Con-volutional Neural Network. Moreover, two embodiments of CURL are investigated: one using Ensemble Projection as unsupervised representation learning coupled with Logistic Regression, and one based on LapSVM. The results show that CURL clearly outperforms other supervised and semi-supervised learning methods in the state of the art. Semi-supervised learning [1] consists in taking into account both labeled and unlabeled data when training machine learning models. It is particularly effective when there is plenty of training data, but only a few instances are labeled. In the last years, many semi-supervised learning approaches have been proposed including generative methods [2], [3], graph-based methods [4], [5], and methods based on Support V ector Machines [6], [7]. Co-training is another example of semi-supervised technique [8].


Semi-described and semi-supervised learning with Gaussian processes

arXiv.org Machine Learning

Propagating input uncertainty through non-linear Gaussian process (GP) mappings is intractable. This hinders the task of training GPs using uncertain and partially observed inputs. In this paper we refer to this task as "semi-described learning". We then introduce a GP framework that solves both, the semi-described and the semi-supervised learning problems (where missing values occur in the outputs). Auto-regressive state space simulation is also recognised as a special case of semi-described learning. To achieve our goal we develop variational methods for handling semi-described inputs in GPs, and couple them with algorithms that allow for imputing the missing values while treating the uncertainty in a principled, Bayesian manner. Extensive experiments on simulated and real-world data study the problems of iterative forecasting and regression/classification with missing values. The results suggest that the principled propagation of uncertainty stemming from our framework can significantly improve performance in these tasks.


Unsupervised Learning in Genome Informatics

arXiv.org Machine Learning

With different genomes available, unsupervised learning algorithms are essential in learning genome-wide biological insights. Especially, the functional characterization of different genomes is essential for us to understand lives. In this book chapter, we review the state-of-the-art unsupervised learning algorithms for genome informatics from DNA to MicroRNA. DNA (DeoxyriboNucleic Acid) is the basic component of genomes. A significant fraction of DNA regions (transcription factor binding sites) are bound by proteins (transcription factors) to regulate gene expression at different development stages in different tissues. To fully understand genetics, it is necessary of us to apply unsupervised learning algorithms to learn and infer those DNA regions. Here we review several unsupervised learning methods for deciphering the genome-wide patterns of those DNA regions. MicroRNA (miRNA), a class of small endogenous non-coding RNA (RiboNucleic acid) species, regulate gene expression post-transcriptionally by forming imperfect base-pair with the target sites primarily at the 3$'$ untranslated regions of the messenger RNAs. Since the 1993 discovery of the first miRNA \emph{let-7} in worms, a vast amount of studies have been dedicated to functionally characterizing the functional impacts of miRNA in a network context to understand complex diseases such as cancer. Here we review several representative unsupervised learning frameworks on inferring miRNA regulatory network by exploiting the static sequence-based information pertinent to the prior knowledge of miRNA targeting and the dynamic information of miRNA activities implicated by the recently available large data compendia, which interrogate genome-wide expression profiles of miRNAs and/or mRNAs across various cell conditions.


Regularized Multi-Task Learning for Multi-Dimensional Log-Density Gradient Estimation

arXiv.org Machine Learning

Multi-task learning is a paradigm of machine learning for solving multiple related learning tasks simultaneously with the expectation that information brought by other related tasks can be mutually exploited to improve the accuracy [Caruana, 1997]. Multi-task learning is particularly useful when one has many related learning tasks to solve but only few training samples are available for each task, which is often the case in many real-world problems such as therapy screening [Bickel et al., 2008] and face verification [Wang et al., 2009]. Multi-task learning has been gathering a great deal of attention, and extensive studies have been conducted both theoretically and experimentally [Thrun, 1996, Evgeniou and Pontil, 2004, Ando and Zhang, 2005, Zhang, 2013, Baxter, 2000]. Thrun [1996] proposed the lifelong learning framework, which transfers the knowledge obtained from the tasks experienced in the past to a newly given task, and it was demonstrated to improve the performance of image recognition. Baxter Baxter [2000] defined a multi-task learning framework called inductive bias learning, and derived a generalization error bound. The semi-supervised multi-task learning method proposed by Ando and Zhang [2005] generates many auxiliary learning 2 tasks from unlabeled data and seeks a good feature mapping for the target learning task.


Graph Construction for Semi-Supervised Learning

AAAI Conferences

Semi-Supervised Learning (SSL) techniques have become very relevant since they require a small set of labeled data. In this scenario, graph-based SSL algorithms provide a powerful framework for modeling manifold structures in high-dimensional spaces and are effective for the propagation of the few initial labels present in training data through the graph. An important step in graph-based SSL methods is the conversion of tabular data into a weighted graph. The graph construction has a key role in the quality of the classification in graph-based methods. Nevertheless, most of the SSL literature focuses on developing label inference algorithms without studying graph construction methods and its effect on the base algorithm performance. This PhD project aims to study this issue and proposes new methods for graph construction from flat data and improves the performance of the graph-based algorithms.


Unsupervised Learning of an IS-A Taxonomy from a Limited Domain-Specific Corpus

AAAI Conferences

Taxonomies hierarchically organize concepts in a domain. Building and maintaining them by hand is a tedious and time-consuming task. This paper proposes a novel, unsupervised algorithm for automatically learning an IS-A taxonomy from scratch by analyzing a given text corpus. Our approach is designed to deal with infrequently occurring concepts, so it can effectively induce taxonomies even from small corpora. Algorithmically, the approach makes two important contributions. First, it performs inference based on clustering and the distributional semantics, which can capture links among concepts never mentioned together. Second, it uses a novel graph-based algorithm to detect and remove incorrect is-a relations from a taxonomy. An empirical evaluation on five corpora demonstrates the utility of our proposed approach.


Optimally Combining Classifiers Using Unlabeled Data

arXiv.org Machine Learning

We develop a worst-case analysis of aggregation of classifier ensembles for binary classification. The task of predicting to minimize error is formulated as a game played over a given set of unlabeled data (a transductive setting), where prior label information is encoded as constraints on the game. The minimax solution of this game identifies cases where a weighted combination of the classifiers can perform significantly better than any single classifier.


The Boundary Forest Algorithm for Online Supervised and Unsupervised Learning

arXiv.org Machine Learning

We describe a new instance-based learning algorithm called the Boundary Forest (BF) algorithm, that can be used for supervised and unsupervised learning. The algorithm builds a forest of trees whose nodes store previously seen examples. It can be shown data points one at a time and updates itself incrementally, hence it is naturally online. Few instance-based algorithms have this property while being simultaneously fast, which the BF is. This is crucial for applications where one needs to respond to input data in real time. The number of children of each node is not set beforehand but obtained from the training procedure, which makes the algorithm very flexible with regards to what data manifolds it can learn. We test its generalization performance and speed on a range of benchmark datasets and detail in which settings it outperforms the state of the art. Empirically we find that training time scales as O(DNlog(N)) and testing as O(Dlog(N)), where D is the dimensionality and N the amount of data.


Accuracy of Latent-Variable Estimation in Bayesian Semi-Supervised Learning

arXiv.org Machine Learning

Hierarchical probabilistic models, such as Gaussian mixture models, are widely used for unsupervised learning tasks. These models consist of observable and latent variables, which represent the observable data and the underlying data-generation process, respectively. Unsupervised learning tasks, such as cluster analysis, are regarded as estimations of latent variables based on the observable ones. The estimation of latent variables in semi-supervised learning, where some labels are observed, will be more precise than that in unsupervised, and one of the concerns is to clarify the effect of the labeled data. However, there has not been sufficient theoretical analysis of the accuracy of the estimation of latent variables. In a previous study, a distribution-based error function was formulated, and its asymptotic form was calculated for unsupervised learning with generative models. It has been shown that, for the estimation of latent variables, the Bayes method is more accurate than the maximum-likelihood method. The present paper reveals the asymptotic forms of the error function in Bayesian semi-supervised learning for both discriminative and generative models. The results show that the generative model, which uses all of the given data, performs better when the model is well specified.


Noise-Robust Semi-Supervised Learning by Large-Scale Sparse Coding

AAAI Conferences

This paper presents a large-scale sparse coding algorithm to deal with the challenging problem of noise-robust semi-supervised learning over very large data with only few noisy initial labels. By giving an L1-norm formulation of Laplacian regularization directly based upon the manifold structure of the data, we transform noise-robust semi-supervised learning into a generalized sparse coding problem so that noise reduction can be imposed upon the noisy initial labels. Furthermore, to keep the scalability of noise-robust semi-supervised learning over very large data, we make use of both nonlinear approximation and dimension reduction techniques to solve this generalized sparse coding problem in linear time and space complexity. Finally, we evaluate the proposed algorithm in the challenging task of large-scale semi-supervised image classification with only few noisy initial labels. The experimental results on several benchmark image datasets show the promising performance of the proposed algorithm.