Testing Neural Network Verifiers: A Soundness Benchmark with Hidden Counterexamples