Collaborating Authors

Making it easier to discover datasets


Similar to how Google Scholar works, Dataset Search lets you find datasets wherever they're hosted, whether it's a publisher's site, a digital library, or an author's personal web page. To create Dataset search, we developed guidelines for dataset providers to describe their data in a way that Google (and other search engines) can better understand the content of their pages. These guidelines include salient information about datasets: who created the dataset, when it was published, how the data was collected, what the terms are for using the data, etc. We then collect and link this information, analyze where different versions of the same dataset might be, and find publications that may be describing or discussing the dataset. Our approach is based on an open standard for describing this information (

Discovering millions of datasets on the web


Based on what we've learned from the early adopters of Dataset Search, we've added new features. You can now filter the results based on the types of dataset that you want (e.g., tables, images, text), or whether the dataset is available for free from the provider. If a dataset is about a geographic area, you can see the map. Plus, the product is now available on mobile and we've significantly improved the quality of dataset descriptions. One thing hasn't changed however: anybody who publishes data can make their datasets discoverable in Dataset Search by using an open standard ( to describe the properties of their dataset on their own web page.

Github's Top Open Datasets For Machine Learning


When working with comprehensive datasets every data scientist seems to have their favorite go to. For free resources, Mansi Singhal CEO of qplum pointed to, "In financial services industry, we find FRED database incredibly useful," she said. Then there are those datasets that are proprietary. "A good example is the stock price data for which you might need to work with an exchange or one of the 3rd party providers," she said.

Integrating Locally Learned Causal Structures with Overlapping Variables

Neural Information Processing Systems

In many domains, data are distributed among datasets that share only some variables; other recorded variables may occur in only one dataset. There are several asymptotically correct, informative algorithms that search for causal information given a single dataset, even with missing values and hidden variables. There are, however, no such reliable procedures for distributed data with overlapping variables, and only a single heuristic procedure (Structural EM). This paper describes an asymptotically correct procedure, ION, that provides all the information about structure obtainable from the marginal independence relations. Using simulated and real data, the accuracy of ION is compared with that of Structural EM, and with inference on complete, unified data.

The new Enigma Public – the platform connecting people to data


We're beyond excited to announce today the relaunch of Enigma Public -- the platform connecting people to data. Enigma Public breaks down the barriers to data access and usability to make the information created by and for the people discoverable by the people. In a time when grounding observations in truth is more important than ever, our goal is connect you with the facts and figures that make up the world around you. With Enigma Public you can search, browse, and discover in the broadest collection of public data. With our new dataset guides you can dive deeper into the essential datasets or tap into some of our under the radar data to spark a new project idea.