Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification