Exploring Fusion Strategies for Multimodal Vision-Language Systems