What are effective preprocessing methods for reducing data set size (e.g., removing records) without losing information for machine learning problems?

Mar-21-2016, 00:29:20 GMT–#artificialintelligence

Sometimes the simplest methods are best... Random sampling is easy to understand, hard to screw up, and unlikely to introduce bias into your process. Building a training pipeline using a random sample (without replacement) of your dataset is a good way to work faster. Once you have a pipeline you're satisfied with, you can then run it again over your entire dataset to estimate the gain in performance from using the entire dataset. If your training pipeline is robust, your results should not change too much, and although your performance might rise, it will tend to do so very slowly as you add more data. The basic intuition here is that the strongest signals in your data will show up even with relatively small samples of the data, almost by definition (if they didn't, they wouldn't be strong!).

artificial intelligence, information, machine learning, (2 more...)

#artificialintelligence

Mar-21-2016, 00:29:20 GMT

News Web Page

Add feedback

Industry:
- Education > Focused Education > Special Education (0.40)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.85)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found