SMOTE: Synthetic Minority Over-sampling Technique
Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P.
–arXiv.org Artificial Intelligence
An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
arXiv.org Artificial Intelligence
Jun-9-2011
- Country:
- North America > United States
- Nebraska > Lancaster County
- Lincoln (0.04)
- Nevada > Clark County
- Las Vegas (0.04)
- Arizona > Maricopa County
- Phoenix (0.04)
- Florida > Hillsborough County
- Tampa (0.04)
- Oregon > Multnomah County
- Portland (0.04)
- Wisconsin > Dane County
- Madison (0.04)
- Indiana > St. Joseph County
- Notre Dame (0.04)
- California
- San Francisco County > San Francisco (0.14)
- San Mateo County > San Mateo (0.04)
- San Diego County > San Diego (0.04)
- Orange County > Irvine (0.04)
- Alameda County > Livermore (0.04)
- New York > New York County
- New York City (0.04)
- Nebraska > Lancaster County
- Europe
- United Kingdom > England
- Cambridgeshire > Cambridge (0.04)
- Switzerland > Zürich
- Zürich (0.04)
- Italy > Apulia
- Bari (0.04)
- Germany > Saxony-Anhalt
- Magdeburg (0.04)
- Belgium > Flanders
- Flemish Brabant > Leuven (0.04)
- United Kingdom > England
- Asia > India
- Maharashtra > Mumbai (0.04)
- North America > United States
- Genre:
- Research Report (0.64)
- Industry:
- Technology: