LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks