The field of AI and machine learning is arguably built on the shoulders of a few hundred papers, many of which draw conclusions using data from a subset of public datasets. Large, labeled corpora have been critical to the success of AI in domains ranging from image classification to audio classification. That's because their annotations expose comprehensible patterns to machine learning algorithms, in effect telling machines what to look for in future datasets so they're able to make predictions. But while labeled data is usually equated with ground truth, datasets can -- and do -- contain errors. The processes used to construct corpora often involve some degree of automatic annotation or crowdsourcing techniques that are inherently error-prone.
Apr-2-2021, 21:45:10 GMT