Learning Human-Human Interactions in Images from Weak Textual Supervision