Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis