STNet: Deep Audio-Visual Fusion Network for Robust Speaker Tracking