Patch-level Representation Learning for Self-supervised Vision Transformers