Identifying Mislabeled Training Data
–arXiv.org Artificial Intelligence
The goal of this approach is to improve classication accuracies produced by learning algorithms by improving the quality of the training data. Our approach uses a set of learning algorithms to create classiers that serve as noise lters for the training data. We evaluate single algorithm, majority vote and consensus lters on ve datasets that are prone to labeling errors. Our experiments illustrate that ltering signicantly improves classication accuracy for noise levels up to 30%. An analytical and empirical evaluation of the precision of our approach shows that consensus lters are conservative at throwing away good data at the expense of retaining bad data and that majority lters are better at detecting bad data at the expense of throwing away good data. This suggests that for situations in which there is a paucity of data, consensus lters are preferable, whereas majority vote lters are preferable for situations with an abundance of data. 1. Introducti The maximum accuracy achievable depends on the quality of the data and on the appropriateness of the chosen learning algorithm for the data. The work described here focuses on improving the quality of training data by identifying and eliminating mislabeled instances prior to applying the chosen learning algorithm, thereby increasing classication accuracy. Labeling error can occur for several reasons including subjectivity, data-entry error, or inadequacy of the information used to label each object. Subjectivity may arise when observations need to be ranked in some way such as disease severity or when the information used to label an object is dierent from the information to which the learning algorithm will have access. For example, when labeling pixels in image data, the analyst typically uses visual input rather than the numeric values of the feature vector corresponding to the observation. Domains in which experts disagree are natural places for subjective labeling errors (Smyth, 1996). A third cause of labeling error arises when the information used to label each observation is inadequate. For example, in the medical domain it may not be possible to perform the tests necessary to guarantee that a diagnosis is 100% accurate. For domains in which labeling errors occur, an automated method of eliminating or correcting mislabeled observations will improve the predictive accuracy of the classier formed from the training data. In this article we address the problem of identifying training instances that are mislabeled.
arXiv.org Artificial Intelligence
Jun-1-2011
- Country:
- Asia > Japan
- Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Europe
- Switzerland > Basel-City
- Basel (0.04)
- United Kingdom > England
- Switzerland > Basel-City
- North America
- Canada > Quebec
- Montreal (0.04)
- United States
- Massachusetts
- Hampshire County > Amherst (0.04)
- Middlesex County > Cambridge (0.04)
- Suffolk County > Boston (0.04)
- California > San Mateo County
- San Mateo (0.04)
- Oregon > Multnomah County
- Portland (0.04)
- Utah > Utah County
- Provo (0.04)
- Tennessee > Davidson County
- Nashville (0.04)
- New York (0.04)
- Indiana > Tippecanoe County
- Lafayette (0.04)
- West Lafayette (0.04)
- Nebraska > Lancaster County
- Lincoln (0.04)
- New Jersey > Middlesex County
- New Brunswick (0.04)
- Massachusetts
- Canada > Quebec
- Oceania > Australia
- Asia > Japan
- Genre:
- Research Report
- Experimental Study (0.68)
- New Finding (0.93)
- Research Report
- Industry:
- Education (0.67)
- Energy (0.68)
- Government > Regional Government
- Technology: