VITA: Video Instance Segmentation via Object Token Association