Towards the Scalable Evaluation of Cooperativeness in Language Models