Making it easier to discover datasets

#artificialintelligence

Similar to how Google Scholar works, Dataset Search lets you find datasets wherever they're hosted, whether it's a publisher's site, a digital library, or an author's personal web page. To create Dataset search, we developed guidelines for dataset providers to describe their data in a way that Google (and other search engines) can better understand the content of their pages. These guidelines include salient information about datasets: who created the dataset, when it was published, how the data was collected, what the terms are for using the data, etc. We then collect and link this information, analyze where different versions of the same dataset might be, and find publications that may be describing or discussing the dataset. Our approach is based on an open standard for describing this information (schema.org)


What are open datasets? Curated public datasets - Azure Open Datasets (preview)

#artificialintelligence

Data scientists often spend the majority of their time cleaning and preparing data for advanced analytics. Open Datasets are copied to the Azure cloud and preprocessed to save you time. At regular intervals data is pulled from the sources, such as by an FTP connection to the National Oceanic and Atmospheric Administration (NOAA), parsed into a structured format, and then enriched as appropriate with features such as ZIP Code or location of the nearest weather station.


20 Weird & Wonderful Datasets for Machine Learning

#artificialintelligence

They say great data is 95% of the problem in machine learning. We saw first hand at Udacity that this is the case, with the amazing reception from the machine learning community when we open sourced over 250GB of driving data. But, finding interesting data is really hard, and actively holds the industry back from progress. In trying to learn more about this problem I searched far and wide, and cataloged just a sliver of the datasets I found. I've also been fascinated with the militarized interstates disputes dataset, which includes 200 years of international threats and conflicts.


20 Weird & Wonderful Datasets for Machine Learning

#artificialintelligence

They say great data is 95% of the problem in machine learning. We saw first hand at Udacity that this is the case, with the amazing reception from the machine learning community when we open sourced over 250GB of driving data. But, finding interesting data is really hard, and actively holds the industry back from progress. In trying to learn more about this problem I searched far and wide, and cataloged just a sliver of the datasets I found. In the hope that others might find this catalog useful, here's 20 weird and wonderful datasets you could (perhaps) use in machine learning.


[R] Unbiased Look at Dataset Bias • r/MachineLearning

@machinelearnbot

I found this paper worth sharing and actually quite disturbing: it seems like no matter how much you think your datasets' classes are "diversified", you can still get to distinguish them between different datasets for example.