Evaluating Large Language Models on Controlled Generation Tasks