Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Open in new window