AITopics | minority-class example

Lin

AAAI ConferencesFeb-8-2022, 09:26:21 GMT

Machine learning in real-world high-skew domains is difficult, because traditional strategies for crowdsourcing labeled training examples are ineffective at locating the scarce minority-class examples. For example, both random sampling and traditional active learning (which reduces to random sampling when just starting) will most likely recover very few minority-class examples. To bootstrap the machine learning process, researchers have proposed tasking the crowd with finding or generating minority-class examples, but such strategies have their weaknesses as well. They are unnecessarily expensive in well-balanced domains, and they often yield samples from a biased distribution that is unrepresentative of the one being learned.This paper extends the traditional active learning framework by investigating the problem of intelligently switching between various crowdsourcing strategies for obtaining labeled training examples in order to optimally train a classifier. We start by analyzing several such strategies (e.g., annotate an example, generate a minority-class example, etc.), and then develop a novel, skew-robust algorithm, called MB-CB, for the control problem. Experiments show that our method outperforms state-of-the-art GL-Hybrid by up to 14.3 points in F1 AUC, across various domains and class-frequency settings.

lin, minority-class example, training example

AAAI Conferences

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction

Provost, F., Weiss, G. M.

arXiv.org Artificial IntelligenceJun-22-2011

For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples and/or the computational costs associated with learning from them. In such circumstances, one question of practical importance is: if only n training examples can be selected, in what proportion should the classes be represented? In this article we help to answer this question by analyzing, for a fixed training-set size, the relationship between the class distribution of the training data and the performance of classification trees induced from these data. We study twenty-six data sets and, for each, determine the best class distribution for learning. The naturally occurring class distribution is shown to generally perform well when classifier performance is evaluated using undifferentiated error rate (0/1 loss). However, when the area under the ROC curve is used to evaluate classifier performance, a balanced distribution is shown to perform well. Since neither of these choices for class distribution always generates the best-performing classifier, we introduce a "budget-sensitive" progressive sampling algorithm for selecting training examples based on the class associated with each example. An empirical analysis of this algorithm shows that the class distribution of the resulting training set yields classifiers with good (nearly-optimal) classification performance.

artificial intelligence, inductive learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1613/jair.1199

1106.4557

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > San Mateo County > Menlo Park (0.04)
North America > United States > New York > New York County > New York City (0.04)
(7 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)

Add feedback

Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction

Weiss, G. M., Provost, F.

Journal of Artificial Intelligence ResearchOct-1-2003

For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples and/or the computational costs associated with learning from them. In such circumstances, one question of practical importance is: if only n training examples can be selected, in what proportion should the classes be represented? In this article we help to answer this question by analyzing, for a fixed training-set size, the relationship between the class distribution of the training data and the performance of classification trees induced from these data. We study twenty-six data sets and, for each, determine the best class distribution for learning. The naturally occurring class distribution is shown to generally perform well when classifier performance is evaluated using undifferentiated error rate (0/1 loss). However, when the area under the ROC curve is used to evaluate classifier performance, a balanced distribution is shown to perform well. Since neither of these choices for class distribution always generates the best-performing classifier, we introduce a "budget-sensitive" progressive sampling algorithm for selecting training examples based on the class associated with each example. An empirical analysis of this algorithm shows that the class distribution of the resulting training set yields classifiers with good (nearly-optimal) classification performance.

class distribution, error rate, minority-class example, (15 more...)

Journal of Artificial Intelligence Research

doi: 10.1613/jair.1199

AI Access Foundation

10346

Journal of Artificial Intelligence Research

Country: