An Empirical Analysis on Large Language Models in Debate Evaluation