Stress Testing Generalization: How Minor Modifications Undermine Large Language Model Performance