Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models