Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark