Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models