Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

Open in new window