Energy-Based Models for Cross-Modal Localization using Convolutional Transformers