VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

Open in new window