Collaborating Authors

Build NLP Pipelines With HuggingFace Datasets


We'll start by exploring the datasets. As we said -- there are a vast number of datasets available, many of those uploaded by the community. Two that I often find myself using are the OSCAR and SQuAD datasets. SQuAD is a brilliant dataset for training Q&A transformer models, generally unparalleled. A single one of these datasets is all we need when fine-tuning a transformer model for Q&A.

Introducing Kaggle Datasets


At Kaggle, we want to help the world learn from data. This sounds bold and grandiose, but the biggest barriers to this are incredibly simple. It's tough to access data. It's tough to understand what's in the data once you access it. We want to change this.

Google just published 25 million free datasets


Google recently released datasetsearch, a free tool for searching 25 million publicly available datasets. The search tool includes filters to limit results based on their license (free or paid), format (csv, images, etc), and update time. The results also include descriptions of the dataset's contents as well as author citations. Google's dataset aggregation methodology differs from other dataset repositories like Amazon's open data registry. Unlike other repositories that curate and host the datasets themselves, Google does not curate or provide direct access to the 25 million datasets directly.

What are open datasets? Curated public datasets - Azure Open Datasets (preview)


Data scientists often spend the majority of their time cleaning and preparing data for advanced analytics. Open Datasets are copied to the Azure cloud and preprocessed to save you time. At regular intervals data is pulled from the sources, such as by an FTP connection to the National Oceanic and Atmospheric Administration (NOAA), parsed into a structured format, and then enriched as appropriate with features such as ZIP Code or location of the nearest weather station.

Making it easier to discover datasets


Similar to how Google Scholar works, Dataset Search lets you find datasets wherever they're hosted, whether it's a publisher's site, a digital library, or an author's personal web page. To create Dataset search, we developed guidelines for dataset providers to describe their data in a way that Google (and other search engines) can better understand the content of their pages. These guidelines include salient information about datasets: who created the dataset, when it was published, how the data was collected, what the terms are for using the data, etc. We then collect and link this information, analyze where different versions of the same dataset might be, and find publications that may be describing or discussing the dataset. Our approach is based on an open standard for describing this information (