Large Language Models are Inconsistent and Biased Evaluators