Detection-based Intermediate Supervision for Visual Question Answering