Leveraging Visual Supervision for Array-based Active Speaker Detection and Localization