A Robust Ensemble Approach to Learn From Positive and Unlabeled Data Using SVM Base Models
Claesen, Marc, De Smet, Frank, Suykens, Johan A. K., De Moor, Bart
We present a novel approach to learn binary classifiers when only positive and unlabeled instances are available (PU learning). This problem is routinely cast as a supervised task with label noise in the negative set. We use an ensemble of SVM models trained on bootstrap resamples of the training data for increased robustness against label noise. The approach can be considered in a bagging framework which provides an intuitive explanation for its mechanics in a semi-supervised setting. We compared our method to state-of-the-art approaches in simulations using multiple public benchmark data sets. The included benchmark comprises three settings with increasing label noise: (i) fully supervised, (ii) PU learning and (iii) PU learning with false positives. Our approach shows a marginal improvement over existing methods in the second setting and a significant improvement in the third. Frank De Smet is a member of the medical management department of the National Alliance of Christian Mutualities. Accepted at Neurocomputing: SI on Advances in Learning with Label Noise 20/10/2014 1. Introduction Training binary classifiers on positive and unlabeled data is referred to as PU learning [31]. The absence of known negative training instances warrants appropriate learning methods. Inaccurate label information can be more problematic than attribute noise [45]. Specialised PU learning approaches are recommended when (i) negative labels cannot be acquired, (ii) the training data contains a large amount of false negatives or (iii) the positive set has many outliers. Practical applications of PU learning typically feature large, imbalanced training sets with a small amount of labeled (positive) and a large amount of unlabeled training instances. The PU learning problem arises in various settings, including web page classification [44], intrusion detection [26] and bioinformatics tasks such as variant prioritization [42], gene prioritization [1, 35] and virtual screening of drug compounds [41]. Though these applications share a common underlying learning problem, the final evaluation criteria may be fundamentally different.
Oct-21-2014
- Country:
- Europe (0.95)
- North America > United States
- California (0.46)
- Genre:
- Personal (1.00)
- Research Report
- Experimental Study (0.94)
- New Finding (0.93)
- Industry:
- Government (0.93)
- Health & Medicine > Therapeutic Area
- Oncology (0.46)
- Education > Focused Education
- Special Education (0.44)
- Technology: