Causal Evaluation of Language Models