PMLB v1.0: an open-source dataset collection for benchmarking machine learning methods

#artificialintelligence 

To our knowledge, PMLB represents the largest publicly available collection of curated, ready-to-use ML benchmark datasets for classification and regression in existence. Competing ML dataset collections--such as the UCI Machine Learning Repository (Dua and Graff, 2017) or Kaggle Datasets--tend to contain a mixture of classification, regression and other datasets, with varying degrees of documentation/preprocessing and often inadequately characterized measures of data quality. Other, smaller collections of datasets--like Scikit-Learn's datasets module (Pedregosa et al., 2011)--can be well-documented and curated, but lack the breadth and scope of PMLB. PMLB aims to balance this tradeoff, a task which we approach through a combination of crowdsourcing datasets, automating the assessment of data quality, and utilizing appropriate third-party tools, such as GitHub's continuous integration features, Pandas Profiling and Git Large File Store, as described in the following text. PMLB consists of three main components: (i) the collection of benchmark datasets, including metadata and associated documentation, (ii) a Python interface for easily accessing the datasets in the PMLB collection and (iii) an R interface providing similar functionality to the Python interface.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found