A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues