Multimodal Learning and Reasoning for Visual Question Answering