Is Functional Correctness Enough to Evaluate Code Language Models? Exploring Diversity of Generated Codes