Performance Analysis
PAC Generalization Bounds for Co-training
Dasgupta, Sanjoy, Littman, Michael L., McAllester, David A.
The rule-based bootstrapping introduced by Y arowsky, and its co-training variant by Blum and Mitchell, have met with considerable empirical success. Earlier work on the theory of co-training has been only loosely related to empirically useful co-training algorithms. Here we give a new P ACstyle bound on generalization error which justifies both the use of confidences -- partial rules and partial labeling of the unlabeled data -- and the use of an agreement-based objective function as suggested by Collins and Singer. Our bounds apply to the multiclass case, i.e., where instances are to be assigned one of
Prodding the ROC Curve: Constrained Optimization of Classifier Performance
Mozer, Michael C., Dodier, Robert, Colagrosso, Michael D., Guerra-Salcedo, Cesar, Wolniewicz, Richard
When designing a two-alternative classifier, one ordinarily aims to maximize the classifier's ability to discriminate between members of the two classes. We describe a situation in a real-world business application of machine-learning prediction in which an additional constraint is placed on the nature of the solution: thatthe classifier achieve a specified correct acceptance or correct rejection rate (i.e., that it achieve a fixed accuracy on members of one class or the other). Our domain is predicting churn in the telecommunications industry. Churn refers to customers who switch from one service provider to another. We propose fouralgorithms for training a classifier subject to this domain constraint, and present results showing that each algorithm yields a reliable improvement in performance.
PAC Generalization Bounds for Co-training
Dasgupta, Sanjoy, Littman, Michael L., McAllester, David A.
The rule-based bootstrapping introduced by Yarowsky, and its cotraining variantby Blum and Mitchell, have met with considerable empirical success. Earlier work on the theory of co-training has been only loosely related to empirically useful co-training algorithms. Here we give a new PACstyle bound on generalization error which justifies both the use of confidences -- partial rules and partial labeling of the unlabeled data -- and the use of an agreement-based objective function as suggested byCollins and Singer. Our bounds apply to the multiclass case, i.e., where instances are to be assigned one of labels for
SMOTE: Synthetic Minority Over-sampling Technique
Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P.
An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of ``normal'' examples with only a small percentage of ``abnormal'' or ``interesting'' examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
Bayes Networks on Ice: Robotic Search for Antarctic Meteorites
Pedersen, Liam, Apostolopoulos, Dimitrios, Whittaker, William
Antarctica contains the most fertile meteorite hunting grounds on Earth. The pristine, dry and cold environment ensures that meteorites deposited there are preserved for long periods. Subsequent glacial flow of the ice sheets where they land concentrates them in particular areas. To date, most meteorites recovered throughout history have been done so in Antarctica in the last 20 years. Furthermore, they are less likely to be contaminated by terrestrial compounds.