Using Distributed Machine Learning to Model Big Data Efficiently
To use spark, we can either run it on an AWS EMR cluster, or if you just want to try it out and play with it, you can also run it on your local Jupiter notebook. There have been many great articles on how to set up your notebook on AWS EMR to use PySpark such as this one. EMR cluster configuration will also largely affect your runtime, which I will mention in the last part. For preprocessing the data, I will be using the Spark RDD manipulation to perform exploratory data analysis and visualization. The rest of the Spark preprocessing code and Plotly visualization code can be found on the Github repo, but here are the graphs out of our initial exploratory analysis.
May-4-2020, 05:34:03 GMT
- Country:
- North America > United States > California > San Francisco County > San Francisco (0.05)
- Technology: