Reasoning or Simply Next Token Prediction? A Benchmark for Stress-Testing Large Language Models