Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?