AITopics | vilbert

We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

name change, pretraining task-agnostic visiolinguistic representation, vilbert, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.61)
Information Technology > Artificial Intelligence > Natural Language (0.41)

Add feedback

TriBERT: Human-centric Audio-visual Representation Learning

Neural Information Processing SystemsDec-24-2025, 03:01:50 GMT

The recent success of transformer models in language, such as BERT, has motivated the use of such architectures for multi-modal feature learning and tasks. However, most multi-modal variants (e.g., ViLBERT) have limited themselves to visual-linguistic data. Relatively few have explored its use in audio-visual modalities, and none, to our knowledge, illustrate them in the context of granular audio-visual detection or segmentation tasks such as sound source separation and localization. In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention. The use of pose keypoints is inspired by recent works that illustrate that such representations can significantly boost performance in many audio-visual scenarios where often one or more persons are responsible for the sound explicitly (e.g., talking) or implicitly (e.g., sound produced as a function of human manipulating an object). From a technical perspective, as part of the TriBERT architecture, we introduce a learned visual tokenization scheme based on spatial attention and leverage weak-supervision to allow granular cross-modal interactions for visual and pose modalities. Further, we supplement learning with sound-source separation loss formulated across all three streams. We pre-train our model on the large MUSIC21 dataset and demonstrate improved performance in audio-visual sound source separation on that dataset as well as other datasets through fine-tuning. In addition, we show that the learned TriBERT representations are generic and significantly improve performance on other audio-visual tasks such as cross-modal audio-visual-pose retrieval by as much as 66.7% in top-1 accuracy.

human-centric audio-visual representation learning, name change, tribert, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.98)

Add feedback

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee

Neural Information Processing SystemsAug-20-2025, 01:50:05 GMT

Neural Information Processing Systems http://nips.cc/

representation, vilbert, vision-and-language task, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Oregon (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.96)
(2 more...)

Add feedback

clear / well-organized [ R1 ]; our approach "very interesting " [ R3 ] and novel [ R2 R3 ]; our results significant and

Neural Information Processing SystemsAug-20-2025, 01:49:52 GMT

We thank the reviewers for the thoughtful feedback! We respond to select comments below but will address all feedback. We investigate the RefCOCO+ task. We will perform more task specific task in supplementary. VCR extends to answer justifications like "[Person3] is delivering These ablations are valuable and will be added to the paper.

ablation, pretrain, vision and language task, (3 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.41)

Technology:

Information Technology > Artificial Intelligence > Vision (0.32)
Information Technology > Artificial Intelligence > Natural Language (0.31)

Add feedback

Reviews: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Neural Information Processing SystemsJan-27-2025, 00:29:44 GMT

I think that this paper is a solid extension of masked language model pre-training to image-and-text (e.g., captioning) tasks. It defines two novel but intuitive pre-training tasks for this scenario: (i) predicting the semantic class of masked image regions given the surrounding image regions (from the same image) and the corresponding text, (ii) predicting whether image and text pairs are aligned. They demonstrate significant improvements over both the previous SOTA and the strong baseline of simply using a pre-trained text-only BERT model. They also show that having two encoders (with different parameters), one for images and one for text, is superior to a joint encoder. I would have liked to have seen more ablation of the pre-training tasks, since I think that this is more interesting than the model depth ablation that the authors performed.

ablation, pretraining task-agnostic visiolinguistic representation, vision-and-language task, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.88)
Information Technology > Artificial Intelligence > Natural Language (0.61)

Add feedback

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Neural Information Processing SystemsOct-10-2024, 22:24:15 GMT

We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

pretraining task-agnostic visiolinguistic representation, vilbert, vision-and-language task, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.65)
Information Technology > Artificial Intelligence > Natural Language (0.45)

Add feedback

TriBERT: Human-centric Audio-visual Representation Learning

Neural Information Processing SystemsOct-10-2024, 08:51:05 GMT

The recent success of transformer models in language, such as BERT, has motivated the use of such architectures for multi-modal feature learning and tasks. However, most multi-modal variants (e.g., ViLBERT) have limited themselves to visual-linguistic data. Relatively few have explored its use in audio-visual modalities, and none, to our knowledge, illustrate them in the context of granular audio-visual detection or segmentation tasks such as sound source separation and localization. In this work, we introduce TriBERT -- a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention. The use of pose keypoints is inspired by recent works that illustrate that such representations can significantly boost performance in many audio-visual scenarios where often one or more persons are responsible for the sound explicitly (e.g., talking) or implicitly (e.g., sound produced as a function of human manipulating an object).

architecture, human-centric audio-visual representation learning, tribert, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization

Agrawal, Aishwarya, Kajić, Ivana, Bugliarello, Emanuele, Davoodi, Elnaz, Gergely, Anita, Blunsom, Phil, Nematzadeh, Aida

arXiv.org Artificial IntelligenceApr-1-2023

Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks such as image captioning and visual question answering (VQA). The quality of such models is commonly assessed by measuring their performance on unseen data that typically comes from the same distribution as the training data. However, when evaluated under out-of-distribution (out-of-dataset) settings for VQA, we observe that these models exhibit poor generalization. We comprehensively evaluate two pretrained V&L models under different settings (i.e. classification and open-ended text generation) by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task. We also find that in most cases generative models are less susceptible to shifts in data distribution compared to discriminative ones, and that multimodal pretraining is generally helpful for OOD generalization. Finally, we revisit assumptions underlying the use of automatic VQA evaluation metrics, and empirically show that their stringent nature repeatedly penalizes models for correct responses.

accuracy, machine learning, question answering, (21 more...)

arXiv.org Artificial Intelligence

2205.12191

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
(5 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine (0.46)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.61)

Add feedback

Filters

Collaborating Authors

vilbert

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

c74d97b01eae257e44aa9d5bade97baf-AuthorFeedback.pdf

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

TriBERT: Human-centric Audio-visual Representation Learning

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

clear / well-organized [ R1 ]; our approach "very interesting " [ R3 ] and novel [ R2 R3 ]; our results significant and

Reviews: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

TriBERT: Human-centric Audio-visual Representation Learning

Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization