Hileman, Ryan
DMLR: Data-centric Machine Learning Research -- Past, Present and Future
Oala, Luis, Maskey, Manil, Bat-Leah, Lilith, Parrish, Alicia, Gürel, Nezihe Merve, Kuo, Tzu-Sheng, Liu, Yang, Dror, Rotem, Brajovic, Danilo, Yao, Xiaozhe, Bartolo, Max, Rojas, William A Gaviria, Hileman, Ryan, Aliment, Rainier, Mahoney, Michael W., Risdal, Meg, Lease, Matthew, Samek, Wojciech, Dutta, Debojyoti, Northcutt, Curtis G, Coleman, Cody, Hancock, Braden, Koch, Bernard, Tadesse, Girmaw Abebe, Karlaš, Bojan, Alaa, Ahmed, Dieng, Adji Bousso, Noy, Natasha, Reddi, Vijay Janapa, Zou, James, Paritosh, Praveen, van der Schaar, Mihaela, Bollacker, Kurt, Aroyo, Lora, Zhang, Ce, Vanschoren, Joaquin, Guyon, Isabelle, Mattson, Peter
Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.
Speech Wikimedia: A 77 Language Multilingual Speech Dataset
Gómez, Rafael Mosquera, Eusse, Julián, Ciro, Juan, Galvez, Daniel, Hileman, Ryan, Bollacker, Kurt, Kanter, David
The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA licensed transcribed speech from a diverse set of scenarios and speakers, in 77 different languages. Each audio file has one or more transcriptions in different languages, making this dataset suitable for training speech recognition, speech translation, and machine translation models.