Many applications are facing the problem of learning from an objective dataset, whereas information from other auxiliary sources may be beneficial but cannot be integrated into the objective dataset for learning. In this paper, we propose an omni-view learning approach to enable learning from multiple data collections. The theme is to organize heterogeneous data sources into a unified table with global data view. To achieve the omni-view learning goal, we consider that the objective dataset and the auxiliary datasets share some instance-level dependency structures. We then propose a relational k-means to cluster instances in each auxiliary dataset, such that clusters can help build new features to capture correlations between the objective and auxiliary datasets. Experimental results demonstrate that omni-view learning can help build models which outperform the ones learned from the objective dataset only. Comparisons with the co-training algorithm further assert that omni-view learning provides an alternative, yet effective, way for semi-supervised learning.
Data scientists often spend the majority of their time cleaning and preparing data for advanced analytics. Open Datasets are copied to the Azure cloud and preprocessed to save you time. At regular intervals data is pulled from the sources, such as by an FTP connection to the National Oceanic and Atmospheric Administration (NOAA), parsed into a structured format, and then enriched as appropriate with features such as ZIP Code or location of the nearest weather station.
Similar to how Google Scholar works, Dataset Search lets you find datasets wherever they're hosted, whether it's a publisher's site, a digital library, or an author's personal web page. To create Dataset search, we developed guidelines for dataset providers to describe their data in a way that Google (and other search engines) can better understand the content of their pages. These guidelines include salient information about datasets: who created the dataset, when it was published, how the data was collected, what the terms are for using the data, etc. We then collect and link this information, analyze where different versions of the same dataset might be, and find publications that may be describing or discussing the dataset. Our approach is based on an open standard for describing this information (schema.org)
Note: Google's new dataset search tool was publicly released on January 23rd, 2020. Google recently released datasetsearch, a free tool for searching 25 million publicly available datasets. The search tool includes filters to limit results based on their license (free or paid), format (csv, images, etc), and update time. The results also include descriptions of the dataset's contents as well as author citations. Google's dataset aggregation methodology differs from other dataset repositories like Amazon's open data registry.