Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering