Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion