BeyondBench: Benchmark-Free Evaluation of Reasoning in Language Models