Evaluating the Performance of Large Language Models via Debates