r/MachineLearning - [D] How to deal with a classification problem of a big mbalanced dataset?


I have a dataset of 8 million unique members, approximately 800 million records. Of those 8 million members I have a positive sample of about 25000. I would like to not simply downsample although the downsampled RF performs pretty well. The data is on a Hadoop cluster. I only have access to it via a Zeppelin notebook with PySpark.