Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark