I have a dataset of 8 million unique members, approximately 800 million records. Of those 8 million members I have a positive sample of about 25000. I would like to not simply downsample although the downsampled RF performs pretty well. The data is on a Hadoop cluster. I only have access to it via a Zeppelin notebook with PySpark.
Oct-6-2019, 14:28:18 GMT