Robust Evaluation Measures for Evaluating Social Biases in Masked Language Models