NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

Open in new window