Benchmarking Large Language Model Capabilities for Conditional Generation