Matton, Alexandre
On Leakage of Code Generation Evaluation Datasets
Matton, Alexandre, Sherborne, Tom, Aumiller, Dennis, Tommasone, Elena, Alizadeh, Milad, He, Jingyi, Ma, Raymond, Voisin, Maxime, Gilsenan-McMahon, Ellen, Gallé, Matthias
A second possibility is that contamination happens indirectly through the use Code generation has emerged as an important skill of synthetic data--a widespread paradigm used in for large language models to master. Measuring recent particular to increase code capabilities by generating progress in code generation has relied on few, additional code training tokens. Finally, we critical benchmarks to judge performance between argue that final model selection might have been model families and checkpoints. While many recent overly influenced by their performance on these sophisticated evaluation datasets have been datasets, overfitting to performance on these metrics proposed (Jain et al., 2024; Jimenez et al., 2024), over general-purpose code-oriented skills.