Re-Evaluating Code LLM Benchmarks Under Semantic Mutation