Leveraging Transformers for Weakly Supervised Object Localization in Unconstrained Videos