Benchmarking Large Language Models via Random Variables