The Use of Unlabeled Data versus Labeled Data for Stopping Active Learning for Text Classification

Beatty, Garrett, Kochis, Ethan, Bloodgood, Michael

Jan-25-2019–arXiv.org Machine Learning

Abstract-- Annotation of training data is the major bottleneck in the creation of text classification systems. Active learning is a commonly used technique to reduce the amount of training data one needs to label. A crucial aspect of active learning is determining when to stop labeling data. Three potential sources for informing when to stop active learning are an additional labeled set of data, an unlabeled set of data, and the training data that is labeled during the process of active learning. To date, no one has compared and contrasted the advantages and disadvantages of stopping methods based on these three information sources. We find that stopping methods that use unlabeled data are more effective than methods that use labeled data. I. INTRODUCTION The use of active learning to train machine learning models has been used as a way to reduce annotation costs for text and speech processing applications [1], [2], [3], [4], [5]. Active learning has been shown to have a particularly large potential for reducing annotation cost for text classification [6], [7]. Text classification is one of the most important fields in semantic computing and it has been used in many applications [8], [9], [10], [11], [12].

active learning, artificial intelligence, inductive learning, (19 more...)

arXiv.org Machine Learning

Jan-25-2019

arXiv.org PDF

Add feedback

Country:
- Europe (1.00)
- North America > United States
  - California > Orange County (0.14)
  - New Jersey > Mercer County
    - Ewing (0.14)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language > Text Classification (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found