How to correctly select a sample from a huge dataset in machine learning