A Closer Look into Automatic Evaluation Using Large Language Models