Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering