What are effective preprocessing methods for reducing data set size (e.g., removing records) without losing information for machine learning problems?
Sometimes the simplest methods are best... Random sampling is easy to understand, hard to screw up, and unlikely to introduce bias into your process. Building a training pipeline using a random sample (without replacement) of your dataset is a good way to work faster. Once you have a pipeline you're satisfied with, you can then run it again over your entire dataset to estimate the gain in performance from using the entire dataset. If your training pipeline is robust, your results should not change too much, and although your performance might rise, it will tend to do so very slowly as you add more data. The basic intuition here is that the strongest signals in your data will show up even with relatively small samples of the data, almost by definition (if they didn't, they wouldn't be strong!).
Mar-21-2016, 00:29:20 GMT