What's in the Box? Reasoning about Unseen Objects from Multimodal Cues