Unbiased Evaluation of Large Language Models from a Causal Perspective