Rethinking Generative Large Language Model Evaluation for Semantic Comprehension