MTA: Multimodal Task Alignment for BEV Perception and Captioning

Open in new window