AITopics | frozenbilm

00d1f03b87a401b1c7957e0cc785d0bc-Paper-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 07:36:03 GMT

T annotation of o tackle questions this of problem, and visual answers question-answer recent for methods videos, . In pretrained here particular build on on, a fr W promising ozen eb-scale bidir approach te ectional xt-only language adapts data to fr multi-modal ozen models autor (BiLM) egr inputs.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

Industry: Education (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Neural Information Processing SystemsDec-23-2025, 16:38:51 GMT

Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models are publicly available at https://github.com/antoyang/FrozenBiLM.

electronic proceedings, frozen bidirectional language model, name change, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.99)

Add feedback

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Neural Information Processing SystemsOct-9-2024, 08:37:34 GMT

Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question.

frozen bidirectional language model, multi-modal data, zero-shot videoqa, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Neural Information Processing SystemsFeb-18-2024, 04:15:49 GMT

Video question answering (VideoQA) is a complex task that requires diverse multimodal data for training. Manual annotation of questions and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models are publicly available at [1].

arxiv preprint arxiv, language model, videoqa, (14 more...)

Neural Information Processing Systems

Country:

Europe > Czechia > Prague (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Revealing the Illusion of Joint Multimodal Understanding in VideoQA Models

Rawal, Ishaan Singh, Jaiswal, Shantanu, Fernando, Basura, Tan, Cheston

arXiv.org Artificial IntelligenceSep-30-2023

While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success are not fully understood. Do these models jointly capture and leverage the rich multimodal structures and dynamics from video and text? Or are they merely exploiting shortcuts to achieve high scores? Hence, we design QUAG (QUadrant AveraGe), a lightweight and non-parametric probe, to critically analyze multimodal representations. QUAG facilitates combined dataset-model study by systematic ablation of model's coupled multimodal understanding during inference. Surprisingly, it demonstrates that the models manage to maintain high performance even under multimodal impairment. We extend QUAG to design "QUAG-attention", a simplistic and lessexpressive replacement of self-attention. We find that the models with QUAGattention achieve similar performance with significantly less mulops without any finetuning. These findings indicate that the current VideoQA benchmarks and metrics do not penalize models that find shortcuts and discount joint multimodal understanding. Motivated by this, we propose the CLAVI (Counterfactual in LAnguage and VIdeo), a diagnostic dataset for coupled multimodal understanding in VideoQA. CLAVI consists of temporal questions and videos that are augmented to curate balanced counterfactuals in language and video domains. We evaluate models on CLAVI and find that all models achieve high performance on multimodal shortcut instances, but most of them have very poor performance on the counterfactual instances that necessitate joint multimodal understanding. Overall, with the multimodal representation analysis using QUAG and diagnostic analysis using CLAVI, we show that many VideoQA models are incapable of learning multimodal representations and that their success on standard datasets is an illusion of joint multimodal understanding. Multimodal learning with videos and language is challenging, despite the shared sequential nature of these modalities, due to their distinct underlying structures. That is, videos exhibit spatio-temporal dynamics in the pixel space, whereas language representation is composed of the syntax and semantics of word sequences. Hence, tasks like Video Question Answering (VideoQA) (Zhong et al., 2022) are difficult as they necessitate the model to acquire accurate representations of both the modalities and establish meaningful connections between them. Transformers have demonstrated exceptional performance on VideoQA benchmarks (Zhong et al., 2022).

proceedings, representation, video, (16 more...)

arXiv.org Artificial Intelligence

2306.08889

Country:

Asia > Singapore (0.04)
Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(7 more...)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Yang, Antoine, Miech, Antoine, Sivic, Josef, Laptev, Ivan, Schmid, Cordelia

arXiv.org Artificial IntelligenceOct-10-2022

Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models are publicly available at https://github.com/antoyang/FrozenBiLM.

frozenbilm, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2206.08155

Country: