FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion