What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs