Multiscale Vision Transformer With Deep Clustering-Guided Refinement for Weakly Supervised Object Localization