Lessons from the Trenches on Reproducible Evaluation of Language Models

Open in new window