A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators