NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark