Learning to Localize Sound Source in Visual Scenes