Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation