A Benchmark Suite for Systematically Evaluating Reasoning Shortcuts

Open in new window