Evaluating the Performance of Large Language Models via Debates

Open in new window