Goto

Collaborating Authors

 normalize test data


Noob question: why should we normalize test data with mean and std from training data? • /r/MachineLearning

#artificialintelligence

Nah. It's only really required for things like Neural Networks where it keeps the gradient descent of features in the space where gradient descent does best, and for Linear/Logistic Regression where it also isn't really required, but makes the weights interpretable as feature importance/contribution to the prediction. For things like Random Forest, which are based on decision trees, they'll find a split anywhere, it doesn't matter how the features are scaled. For stuff like Nearest Neighbours, it can be important, or it can hurt. This is because normalisation is like saying all features are equally important, which isn't necessarily true. It could be the case that you've got spatial information in a rectangular space, and so normalising is favouring the small axis of that rectangle over the other axis.


Noob question: why should we normalize test data with mean and std from training data? • /r/MachineLearning

@machinelearnbot

Well, since both sets are samples from the same distribution, they should ideally have similar means and variances. They obviously won't be identical though, and in this case it makes sense to use the means and variances from the training data, since it's what the model was trained on. The model approximates a mapping from data standardized by the training data's mean and variance, so using the test data's mean and variance would give you inaccurate results.