Multimodal Visual Transformer for Sim2real Transfer in Visual Reinforcement Learning