MTA: Multimodal Task Alignment for BEV Perception and Captioning