Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning

Neural Information Processing Systems 

If we perform 1000 training runs (which is not uncommon today) naively using grid search for hyper-parameter tuning, it will take 4000 GPU hours.