Learning Local and Global Temporal Contexts for Video Semantic Segmentation