Identifying Mislabeled Training Data

Jun-1-2011–arXiv.org Artificial Intelligence

The goal of this approach is to improve classication accuracies produced by learning algorithms by improving the quality of the training data. Our approach uses a set of learning algorithms to create classiers that serve as noise lters for the training data. We evaluate single algorithm, majority vote and consensus lters on ve datasets that are prone to labeling errors. Our experiments illustrate that ltering signicantly improves classication accuracy for noise levels up to 30%. An analytical and empirical evaluation of the precision of our approach shows that consensus lters are conservative at throwing away good data at the expense of retaining bad data and that majority lters are better at detecting bad data at the expense of throwing away good data. This suggests that for situations in which there is a paucity of data, consensus lters are preferable, whereas majority vote lters are preferable for situations with an abundance of data. 1. Introducti The maximum accuracy achievable depends on the quality of the data and on the appropriateness of the chosen learning algorithm for the data. The work described here focuses on improving the quality of training data by identifying and eliminating mislabeled instances prior to applying the chosen learning algorithm, thereby increasing classication accuracy. Labeling error can occur for several reasons including subjectivity, data-entry error, or inadequacy of the information used to label each object. Subjectivity may arise when observations need to be ranked in some way such as disease severity or when the information used to label an object is dierent from the information to which the learning algorithm will have access. For example, when labeling pixels in image data, the analyst typically uses visual input rather than the numeric values of the feature vector corresponding to the observation. Domains in which experts disagree are natural places for subjective labeling errors (Smyth, 1996). A third cause of labeling error arises when the information used to label each observation is inadequate. For example, in the medical domain it may not be possible to perform the tests necessary to guarantee that a diagnosis is 100% accurate. For domains in which labeling errors occur, an automated method of eliminating or correcting mislabeled observations will improve the predictive accuracy of the classier formed from the training data. In this article we address the problem of identifying training instances that are mislabeled.

artificial intelligence, classier, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Jun-1-2011

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Victoria > Melbourne (0.04)
- North America
  - United States
    - New York (0.04)
    - New Jersey > Middlesex County
      - New Brunswick (0.04)
    - Nebraska > Lancaster County
      - Lincoln (0.04)
    - Indiana > Tippecanoe County
      - West Lafayette (0.04)
      - Lafayette (0.04)
    - Tennessee > Davidson County
      - Nashville (0.04)
    - Utah > Utah County
      - Provo (0.04)
    - Oregon > Multnomah County
      - Portland (0.04)
    - California > San Mateo County
      - San Mateo (0.04)
    - Massachusetts
      - Suffolk County > Boston (0.04)
      - Hampshire County > Amherst (0.04)
      - Middlesex County > Cambridge (0.04)
  - Canada > Quebec
    - Montreal (0.04)
- Europe
  - United Kingdom > England
    - Berkshire > Wokingham (0.04)
  - Switzerland > Basel-City
    - Basel (0.04)
- Asia > Japan
  - Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)

Genre:
- Research Report
  - New Finding (0.93)
  - Experimental Study (0.68)

Industry:
- Energy (0.68)
- Education (0.67)
- Government > Regional Government
  - North America Government > United States Government (0.67)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Statistical Learning (1.00)
  - Inductive Learning (1.00)
  - Performance Analysis > Accuracy (0.34)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found