Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation