Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval