Training and Evaluating with Human Label Variation: An Empirical Study