Learning Spatial-Semantic Features for Robust Video Object Segmentation