Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think