A novel multimodal dynamic fusion network for disfluency detection in spoken utterances