Data labeling for AI research is highly inconsistent, study finds

#artificialintelligence 

Supervised machine learning, in which machine learning models learn from labeled training data, is only as good as the quality of that data. In a study published in the journal Quantitative Science Studies, researchers at consultancy Webster Pacific and the University of California, San Diego and Berkeley investigate to what extent best practices around data labeling are followed in AI research papers, focusing on human-labeled data. They found that the types of labeled data range widely from paper to paper and that a "plurality" of the studies they surveyed gave no information about who performed labeling -- or where the data came from. While labeled data is usually equated with ground truth, datasets can -- and do -- contain errors. The processes used to build them are inherently error-prone, which becomes problematic when these errors reach test sets, the subsets of datasets researchers use to compare progress. A recent MIT paper identified thousands to millions of mislabeled samples in datasets used to train commercial systems.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found