AITopics | vision-and-language task

Collaborating Authors

vision-and-language task

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee

Neural Information Processing SystemsFeb-14-2026, 02:06:29 GMT

Neural Information Processing Systems http://nips.cc/

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Oregon (0.04)
North America > Canada (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.96)
(2 more...)

Add feedback

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Neural Information Processing SystemsDec-25-2025, 23:17:19 GMT

We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

name change, pretraining task-agnostic visiolinguistic representation, vilbert, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.61)
Information Technology > Artificial Intelligence > Natural Language (0.41)

Add feedback

CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks

Neural Information Processing SystemsDec-25-2025, 03:51:11 GMT

Current state-of-the-art vision-and-language models are evaluated on tasks either individually or in a multi-task setting, overlooking the challenges of continually learning (CL) tasks as they arrive. Existing CL benchmarks have facilitated research on task adaptation and mitigating catastrophic forgetting, but are limited to vision-only and language-only tasks. We present CLiMB, a benchmark to study the challenge of learning multimodal tasks in a CL setting, and to systematically evaluate how upstream continual learning can rapidly generalize to new multimodal and unimodal tasks. CLiMB includes implementations of several CL algorithms and a modified Vision-Language Transformer (ViLT) model that can be deployed on both multimodal and unimodal tasks. We find that common CL methods can help mitigate forgetting during multimodal task learning, but do not enable cross-task knowledge transfer. We envision that CLiMB will facilitate research on a new class of CL algorithms for this challenging multimodal setting.

continual learning benchmark, name change, vision-and-language task, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.41)

Add feedback

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee

Neural Information Processing SystemsAug-20-2025, 01:50:05 GMT

Neural Information Processing Systems http://nips.cc/

representation, vilbert, vision-and-language task, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Oregon (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.96)
(2 more...)

Add feedback

Reviews: ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Neural Information Processing SystemsJan-27-2025, 00:29:44 GMT

I think that this paper is a solid extension of masked language model pre-training to image-and-text (e.g., captioning) tasks. It defines two novel but intuitive pre-training tasks for this scenario: (i) predicting the semantic class of masked image regions given the surrounding image regions (from the same image) and the corresponding text, (ii) predicting whether image and text pairs are aligned. They demonstrate significant improvements over both the previous SOTA and the strong baseline of simply using a pre-trained text-only BERT model. They also show that having two encoders (with different parameters), one for images and one for text, is superior to a joint encoder. I would have liked to have seen more ablation of the pre-training tasks, since I think that this is more interesting than the model depth ablation that the authors performed.

ablation, pretraining task-agnostic visiolinguistic representation, vision-and-language task, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.88)
Information Technology > Artificial Intelligence > Natural Language (0.61)

Add feedback

CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks

Neural Information Processing SystemsJan-18-2025, 18:21:59 GMT

Current state-of-the-art vision-and-language models are evaluated on tasks either individually or in a multi-task setting, overlooking the challenges of continually learning (CL) tasks as they arrive. Existing CL benchmarks have facilitated research on task adaptation and mitigating "catastrophic forgetting", but are limited to vision-only and language-only tasks. We present CLiMB, a benchmark to study the challenge of learning multimodal tasks in a CL setting, and to systematically evaluate how upstream continual learning can rapidly generalize to new multimodal and unimodal tasks. CLiMB includes implementations of several CL algorithms and a modified Vision-Language Transformer (ViLT) model that can be deployed on both multimodal and unimodal tasks. We find that common CL methods can help mitigate forgetting during multimodal task learning, but do not enable cross-task knowledge transfer.

continual learning benchmark, multimodal and unimodal task, vision-and-language task, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.44)

Add feedback

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Neural Information Processing SystemsOct-10-2024, 22:24:15 GMT

pretraining task-agnostic visiolinguistic representation, vilbert, vision-and-language task, (1 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.65)
Information Technology > Artificial Intelligence > Natural Language (0.45)

Add feedback

Newvision: application for helping blind people using deep learning

Bobba, Kumar Srinivas, K, Kartheeban, Boddu, Vamsi Krishna Sai, Bolla, Vijaya Mani Surendra, Bugga, Dinesh

arXiv.org Artificial IntelligenceNov-5-2023

As able-bodied people, we often take our vision for granted. For people who are visually impaired, however, their disability can have a significant impact on their daily lives. We are developing proprietary headgear that will help visually impaired people navigate their surroundings, identify objects and people, read text, and avoid obstacles. The headgear will use a combination of computer vision, distance estimation with ultrasonic sensors, voice recognition, and voice assistants to provide users with real-time information about their environment. Users will be able to interact with the headgear through voice commands, such as ''What is that?'' to identify an object or ''Navigate to the front door'' to find their way around. The headgear will then provide the user with a verbal description of the object or spoken navigation instructions. We believe that this headgear has the potential to make a significant difference in the lives of visually impaired people, allowing them to live more independently and participate more fully in society.

benchmark, dataset, reasoning, (16 more...)

arXiv.org Artificial Intelligence

2311.03395

Country: Asia > India (0.05)

Genre: Research Report (0.40)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)
(2 more...)

Add feedback

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

Gan, Zhe, Li, Linjie, Li, Chunyuan, Wang, Lijuan, Liu, Zicheng, Gao, Jianfeng

arXiv.org Artificial IntelligenceOct-17-2022

This paper surveys vision-language pre-training (VLP) methods for multimodal intelligence that have been developed in the last few years. We group these approaches into three categories: ($i$) VLP for image-text tasks, such as image captioning, image-text retrieval, visual question answering, and visual grounding; ($ii$) VLP for core computer vision tasks, such as (open-set) image classification, object detection, and segmentation; and ($iii$) VLP for video-text tasks, such as video captioning, video-text retrieval, and video question answering. For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies. In addition, for each category, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.

large language model, machine learning, question answering, (25 more...)

arXiv.org Artificial Intelligence

2210.09263

Country:

Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre:

Research Report (1.00)
Overview (1.00)
Instructional Material > Course Syllabus & Notes (0.88)

Industry:

Leisure & Entertainment (1.00)
Education (1.00)
Media (0.92)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Vision > Image Understanding (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

Learning More May Not Be Better: Knowledge Transferability in Vision and Language Tasks

Chen, Tianwei, Garcia, Noa, Otani, Mayu, Chu, Chenhui, Nakashima, Yuta, Nagahara, Hajime

arXiv.org Artificial IntelligenceAug-23-2022

Is more data always better to train vision-and-language models? We study knowledge transferability in multi-modal tasks. The current tendency in machine learning is to assume that by joining multiple datasets from different tasks their overall performance will improve. However, we show that not all the knowledge transfers well or has a positive impact on related tasks, even when they share a common goal. We conduct an exhaustive analysis based on hundreds of cross-experiments on 12 vision-and-language tasks categorized in 4 groups. Whereas tasks in the same group are prone to improve each other, results show that this is not always the case. Other factors such as dataset size or pre-training stage have also a great impact on how well the knowledge is transferred.

gqa, knowledge, transferability, (17 more...)

arXiv.org Artificial Intelligence

2208.10758

Country:

Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.04)
Europe > Poland (0.04)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback