Supplementary Material for Self-supervised Co-Training for Video Representation Learning