The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track
–Neural Information Processing Systems
Data curation is a field with origins in librarianship and archives, whose scholarship and thinking on data issues go back centuries, if not millennia. The field of machine learning is increasingly observing the importance of data curation to the advancement of both applications and fundamental understanding of machine learning models - evidenced not least by the creation of the Datasets and Benchmarks track itself. This work provides an analysis of recent dataset development practices at NeurIPS through the lens of data curation. We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit developed through a thorough literature review of data curation principles. We use the framework to systematically assess the strengths and weaknesses in current dataset development practices of 60 datasets published in the NeurIPS Datasets and Benchmarks track from 2021-2023.
Neural Information Processing Systems
Mar-21-2025, 12:10:17 GMT
- Country:
- Europe (0.93)
- North America
- Canada > Ontario
- Toronto (0.14)
- United States > Massachusetts
- Middlesex County > Cambridge (0.14)
- Canada > Ontario
- Genre:
- Overview (0.88)
- Research Report
- Experimental Study (1.00)
- New Finding (1.00)
- Industry:
- Energy (0.46)
- Health & Medicine (0.46)
- Information Technology (0.46)
- Law (0.67)
- Technology: