Multimodal Commonsense Knowledge Distillation for Visual Question Answering