PMLB v1.0: an open-source dataset collection for benchmarking machine learning methods

Aug-20-2022, 11:40:09 GMT–#artificialintelligence

To our knowledge, PMLB represents the largest publicly available collection of curated, ready-to-use ML benchmark datasets for classification and regression in existence. Competing ML dataset collections--such as the UCI Machine Learning Repository (Dua and Graff, 2017) or Kaggle Datasets--tend to contain a mixture of classification, regression and other datasets, with varying degrees of documentation/preprocessing and often inadequately characterized measures of data quality. Other, smaller collections of datasets--like Scikit-Learn's datasets module (Pedregosa et al., 2011)--can be well-documented and curated, but lack the breadth and scope of PMLB. PMLB aims to balance this tradeoff, a task which we approach through a combination of crowdsourcing datasets, automating the assessment of data quality, and utilizing appropriate third-party tools, such as GitHub's continuous integration features, Pandas Profiling and Git Large File Store, as described in the following text. PMLB consists of three main components: (i) the collection of benchmark datasets, including metadata and associated documentation, (ii) a Python interface for easily accessing the datasets in the PMLB collection and (iii) an R interface providing similar functionality to the Python interface.

benchmark dataset, dataset, pmlb, (14 more...)

#artificialintelligence

Aug-20-2022, 11:40:09 GMT

News Web Page

Add feedback

Country:
- North America > United States > California > Orange County > Irvine (0.05)

Technology:
- Information Technology
  - Software (1.00)
  - Artificial Intelligence > Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found