Toward a unified framework for data-efficient evaluation of large language models