Training Subset Selection for Weak Supervision

Lang, Hunter, Vijayaraghavan, Aravindan, Sontag, David

Mar-6-2023–arXiv.org Artificial Intelligence

Existing weak supervision approaches use all the data covered by weak signals to train a classifier. We show both theoretically and empirically that this is not always optimal. Intuitively, there is a tradeoff between the amount of weakly-labeled data and the precision of the weak labels. We explore this tradeoff by combining pretrained data representations with the cut statistic (Muhlenbach et al., 2004) to select (hopefully) high-quality subsets of the weakly-labeled training data. Subset selection applies to any label model and classifier and is very simple to plug in to existing weak supervision pipelines, requiring just a few lines of code. We show our subset selection method improves the performance of weak supervision for a wide range of label models, classifiers, and datasets. Using less weakly-labeled data improves the accuracy of weak supervision pipelines by up to 19% (absolute) on benchmark tasks.

artificial intelligence, cut statistic, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Mar-6-2023

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- North America > United States
  - Massachusetts > Middlesex County
    - Cambridge (0.14)
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
  - New Jersey > Mercer County
    - Princeton (0.04)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Health & Medicine > Diagnostic Medicine > Imaging (0.46)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found