Self-supervised learning of video representations from a child's perspective