A key issue in supervised protein classification is the representation of input sequencesof amino acids. Recent work using string kernels for protein datahas achieved state-of-the-art classification performance. However, suchrepresentations are based only on labeled data -- examples with known 3D structures, organized into structural classes -- while in practice, unlabeled data is far more plentiful.
We propose Rademacher complexity bounds for multiclass classifiers trained with a two-step semi-supervised model. In the first step, the algorithm partitions the partially labeled data and then identifies dense clusters containing $\kappa$ predominant classes using the labeled training examples such that the proportion of their non-predominant classes is below a fixed threshold. In the second step, a classifier is trained by minimizing a margin empirical loss over the labeled training set and a penalization term measuring the disability of the learner to predict the $\kappa$ predominant classes of the identified clusters. The resulting data-dependent generalization error bound involves the margin distribution of the classifier, the stability of the clustering technique used in the first step and Rademacher complexity terms corresponding to partially labeled training data. Our theoretical result exhibit convergence rates extending those proposed in the literature for the binary case, and experimental results on different multiclass classification problems show empirical evidence that supports the theory.
In this paper, we introduce a package for semi-supervised learning research in the R programming language called RSSL. We cover the purpose of the package, the methods it includes and comment on their use and implementation. We then show, using several code examples, how the package can be used to replicate well-known results from the semi-supervised learning literature.
Current word sense disambiguation (WSD) systems based on supervised learning are still limited in that they do not work well for all words in a language. One of the main reasons is the lack of sufficient training data. In this paper, we investigate the use of unlabeled training data for WSD, in the framework of semi-supervised learning. Four semisupervised learning algorithms are evaluated on 29 nouns of Senseval-2 (SE2) English lexical sample task and SE2 English all-words task. Empirical results show that unlabeled data can bring significant improvement in WSD accuracy.
Semi-supervised learning methods are motivated by the relative paucity of labeled data and aim to utilize large sources of unlabeled data to improve predictive tasks. It has been noted, however, such improvements are not guaranteed in general in some cases the unlabeled data impairs the performance. A fundamental source of error comes from restrictive assumptions about the unlabeled features. In this paper, we develop a semi-supervised learning approach that relaxes such assumptions and is robust with respect to labels missing at random. The approach ensures that uncertainty about the classes is propagated to the unlabeled features in a robust manner. It is applicable using any generative model with associated learning algorithm. We illustrate the approach using both standard synthetic data examples and the MNIST data with unlabeled adversarial examples.