Evaluating Multimodal Large Language Models on Core Music Perception Tasks