Reviews: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Jan-27-2025, 00:29:44 GMT–Neural Information Processing Systems

I think that this paper is a solid extension of masked language model pre-training to image-and-text (e.g., captioning) tasks. It defines two novel but intuitive pre-training tasks for this scenario: (i) predicting the semantic class of masked image regions given the surrounding image regions (from the same image) and the corresponding text, (ii) predicting whether image and text pairs are aligned. They demonstrate significant improvements over both the previous SOTA and the strong baseline of simply using a pre-trained text-only BERT model. They also show that having two encoders (with different parameters), one for images and one for text, is superior to a joint encoder. I would have liked to have seen more ablation of the pre-training tasks, since I think that this is more interesting than the model depth ablation that the authors performed.

ablation, pretraining task-agnostic visiolinguistic representation, vision-and-language task, (5 more...)

Neural Information Processing Systems

Jan-27-2025, 00:29:44 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (0.88)
  - Natural Language (0.61)