Cocktail: Mixing Multi-Modality Control for Text-Conditional Image Generation