Repeatable sampling of data sets in BigQuery for machine learning
Doing machine learning on distributed data sets is methodologically similar to working with data that fits in-memory--train your algorithm on a subset of the data, validate on another subset, and finally test with a different subset. In this post, we'll discuss how to pull data from BigQuery (the no-ops data warehouse that is part of Google Cloud Platform) into machine-learning-ready data sets. We'll use Airline Ontime Performance data, a 70 million row data set from the U.S. Bureau of Transportation statistics, that is available to all users in BigQuery as the airline_ontime_data.flights data set. The RAND() function returns a value between 0–1, so approximately 80% of the rows in the data set will be selected by this query. You want to create three data sets: training, validation, and testing, and while you got 80% of the data above, it is not nearly as easy to get the 20% that were not selected, let alone split that data into two parts. The RAND() function returns different things each time it is run, so if you run the query again, you will get a different 80% of rows.
Nov-14-2016, 13:05:31 GMT