Learning Joint Statistical Models for Audio-Visual Fusion and Segregation