Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models