Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment