The Sample Complexity of Semi-Supervised Learning with Nonparametric Mixture Models

Neural Information Processing Systems

We study the sample complexity of semi-supervised learning (SSL) and introduce new assumptions based on the mismatch between a mixture model learned from unlabeled data and the true mixture model induced by the (unknown) class conditional distributions. Under these assumptions, we establish an $\Omega(K\log K)$ labeled sample complexity bound without imposing parametric assumptions, where $K$ is the number of classes. Our results suggest that even in nonparametric settings it is possible to learn a near-optimal classifier using only a few labeled samples. Unlike previous theoretical work which focuses on binary classification, we consider general multiclass classification ($K>2$), which requires solving a difficult permutation learning problem. This permutation defines a classifier whose classification error is controlled by the Wasserstein distance between mixing measures, and we provide finite-sample results characterizing the behaviour of the excess risk of this classifier. Finally, we describe three algorithms for computing these estimators based on a connection to bipartite graph matching, and perform experiments to illustrate the superiority of the MLE over the majority vote estimator.


On Defending Against Label Flipping Attacks on Malware Detection Systems

arXiv.org Artificial Intelligence

Label manipulation attacks are a subclass of data poisoning attacks in adversarial machine learning used against different applications, such as malware detection. These types of attacks represent a serious threat to detection systems in environments having high noise rate or uncertainty, such as complex networks and Internet of Thing (IoT). Recent work in the literature has suggested using the $K$-Nearest Neighboring (KNN) algorithm to defend against such attacks. However, such an approach can suffer from low to wrong detection accuracy. In this paper, we design an architecture to tackle the Android malware detection problem in IoT systems. We develop an attack mechanism based on Silhouette clustering method, modified for mobile Android platforms. We proposed two Convolutional Neural Network (CNN)-type deep learning algorithms against this \emph{Silhouette Clustering-based Label Flipping Attack (SCLFA)}. We show the effectiveness of these two defense algorithms - \emph{Label-based Semi-supervised Defense (LSD)} and \emph{clustering-based Semi-supervised Defense (CSD)} - in correcting labels being attacked. We evaluate the performance of the proposed algorithms by varying the various machine learning parameters on three Android datasets: Drebin, Contagio, and Genome and three types of features: API, intent, and permission. Our evaluation shows that using random forest feature selection and varying ratios of features can result in an improvement of up to 19\% accuracy when compared with the state-of-the-art method in the literature.


Lautum Regularization for Semi-supervised Transfer Learning

arXiv.org Machine Learning

Transfer learning is a very important tool in deep learning as it allows propagating information from one "source dataset" to another "target dataset", especially in the case of a small number of training examples in the latter. Yet, discrepancies between the underlying distributions of the source and target data are commonplace and are known to have a substantial impact on algorithm performance. In this work we suggest a novel information theoretic approach for the analysis of the performance of deep neural networks in the context of transfer learning. We focus on the task of semi-supervised transfer learning, in which unlabeled samples from the target dataset are available during the network training on the source dataset. Our theory suggests that one may improve the transferability of a deep neural network by imposing a Lautum information based regularization that relates the network weights to the target data. We demonstrate in various transfer learning experiments the effectiveness of the proposed approach.


Semi-Supervised Classification for oil reservoir

arXiv.org Machine Learning

This paper addresses the general problem of accurate identification of oil reservoirs. Recent improvements in well or borehole logging technology have resulted in an explosive amount of data available for processing. The traditional methods of analysis of the logs characteristics by experts require significant amount of time and money and is no longer practicable. In this paper, we use the semi-supervised learning to solve the problem of ever-increasing amount of unlabelled data available for interpretation. The experts are needed to label only a small amount of the log data. The neural network classifier is first trained with the initial labelled data. Next, batches of unlabelled data are being classified and the samples with the very high class probabilities are being used in the next training session, bootstrapping the classifier. The process of training, classifying, enhancing the labelled data is repeated iteratively until the stopping criteria are met, that is, no more high probability samples are found. We make an empirical study on the well data from Jianghan oil field and test the performance of the neural network semi-supervised classifier. We compare this method with other classifiers. The comparison results show that our neural network semi-supervised classifier is superior to other classification methods.


Robust Semi-Supervised Learning when Labels are Missing at Random

arXiv.org Machine Learning

Semi-supervised learning methods are motivated by the relative paucity of labeled data and aim to utilize large sources of unlabeled data to improve predictive tasks. It has been noted, however, such improvements are not guaranteed in general in some cases the unlabeled data impairs the performance. A fundamental source of error comes from restrictive assumptions about the unlabeled features. In this paper, we develop a semi-supervised learning approach that relaxes such assumptions and is robust with respect to labels missing at random. The approach ensures that uncertainty about the classes is propagated to the unlabeled features in a robust manner. It is applicable using any generative model with associated learning algorithm. We illustrate the approach using both standard synthetic data examples and the MNIST data with unlabeled adversarial examples.