Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation