Reasoning or Simply Next Token Prediction? A Benchmark for Stress-Testing Large Language Models

Open in new window