More public data key to democratizing ML, says MLCommons
Unless you're an English speaker, and one with as neutral an American accent as possible, you've probably butted heads with a digital assistant that couldn't understand you. With any luck, a couple of open-source datasets from MLCommons could help future systems grok your voice. The two datasets, which were made generally available in December, are the People's Speech Dataset (PSD), a 30,000-hour database of spontaneous English speech; and the Multilingual Spoken Words Corpus (MSWC), a dataset of some 340,000 keywords in 50 languages. By making both datasets publicly available under CC-BY and CC-BY-SA licenses, MLCommons hopes to democratize machine learning – that is to say, make it available to everyone – and help push the industry toward data-centric AI. David Kanter, executive director and founder of MLCommons, told Nvidia in a podcast this week that he sees data-centric AI as a conceptual pivot from "which model is the most accurate," to "what can we do with data to improve model accuracy."
Apr-19-2022, 14:31:16 GMT