Collaborating Authors

Google BigQuery Public Datasets


A Social Security Administration dataset that contains all names from Social Security card applications for births that occurred in the United States after 1879. Data collected by the NYC Taxi and Limousine Commission (TLC) that includes trip records from all trips completed in yellow and green taxis in NYC from 2009 to 2015. A dataset that contains all stories and comments from Hacker News since its launch in 2006. A dataset published by the US Department of Health and Human Services that includes all weekly surveillance reports of nationally notifiable diseases for all U.S. cities and states published between 1888 and 2013. A dataset that contains 3.5 million digitized books stretching back two centuries, encompassing the complete English-language public domain collections of the Internet Archive (1.3M volumes) and HathiTrust (2.2 million volumes).

Google announces new AI, smart analytics tools


At the Google Cloud Next conference on Wednesday, Google is rolling out a slew of AI and smart analytics tools. The tools are focused on applying AI to common business challenges such as structuring data from documents or forecasting inventory. First, Google announced AI Platform in beta -- an end-to-end development platform that helps teams collaborate on machine learning projects. It's built for developers, data scientists and data engineers, enabling them to share models, train and scale workloads from the same dashboard within Cloud Console. Next, Google is rolling out new versions of Cloud AutoML, the software that automates the creation of machine learning models that Google announced last year.

BigQuery for Data Science


One of the perks of using Google Cloud Platform (GCP) is having BigQuery, Google's cloud hosted data warehouse solution at your disposal. BigQuery gives GCP users access to the key features of Dremel, Google's very own internal data warehouse solution. Under the hood Dremel stores data in columnar format and uses a tree architecture to parallelise queries across thousands of machines, with each query scanning the entire table. So, what is so great about that? With BigQuery you can run SQL queries on a table with billions of rows and get the results in seconds!

New healthcare and population datasets now available in Google BigQuery Google Cloud Big Data and Machine Learning Blog Google Cloud Platform


We've just added several publicly available healthcare datasets to the collection of public datasets on Google BigQuery (the cloud-native data warehouse for analytics at petabyte scale), including RxNorm (maintained by NLM) and the Healthcare Common Procedure Coding System (HCPCS) Level II. While it's not technically a healthcare dataset, we also added the 2000 and 2010 Decennial census counts broken down by age, gender and zip code tabular areas, which we hope will assist healthcare utilization and population health analysis (as we'll discuss below). Anyone with a Google Cloud Platform (GCP) account can explore these datasets. RxNorm was created by the U.S. National Library of Medicine (NLM) to provide a normalized naming system for clinical drugs and provide structured information such as brand names, ingredients and so on for each drug. Drug information is made available as a single "concepts" table while the relationships that map entities to each other (ingredient to brand name, for example) is made available as a separate "relationships" table.