Towards Reliable Domain Generalization: A New Dataset and Evaluations