Catwalk: A Unified Language Model Evaluation Framework for Many Datasets