Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning