Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction