AITopics | cvlm

BMU-MoCo: BidirectionalMomentumUpdate forContinualVideo-LanguageModeling

Neural Information Processing SystemsFeb-10-2026, 17:59:17 GMT

Different from the original MoCo [19] and its cross-modal versions [15, 33, 35] that utilize momentum update for only momentum encoders to maintain a large consistent queue, our BMU strategy imposes momentum update on both momentum encoders and (video/text) encoders.

artificial intelligence, encoder, machine learning, (18 more...)

Neural Information Processing Systems

Country:

Asia > China > Zhejiang Province > Hangzhou (0.04)
Asia > China > Beijing > Beijing (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

BMU-MoCo: Bidirectional Momentum Update for Continual Video-Language Modeling

Neural Information Processing SystemsDec-24-2025, 19:06:36 GMT

Video-language models suffer from forgetting old/learned knowledge when trained with streaming data. In this work, we thus propose a continual video-language modeling (CVLM) setting, where models are supposed to be sequentially trained on five widely-used video-text datasets with different data distributions. Although most of existing continual learning methods have achieved great success by exploiting extra information (e.g., memory data of past tasks) or dynamically extended networks, they cause enormous resource consumption when transferred to our CVLM setting. To overcome the challenges (i.e., catastrophic forgetting and heavy resource consumption) in CVLM, we propose a novel cross-modal MoCo-based model with bidirectional momentum update (BMU), termed BMU-MoCo. Concretely, our BMU-MoCo has two core designs: (1) Different from the conventional MoCo, we apply the momentum update to not only momentum encoders but also encoders (i.e., bidirectional) at each training step, which enables the model to review the learned knowledge retained in the momentum encoders.

bidirectional momentum update, bmu-moco, continual video-language modeling, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.62)
Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

BMU-MoCo: Bidirectional Momentum Update for Continual Video-Language Modeling - Supplementary Material - Yizhao Gao

Neural Information Processing SystemsAug-16-2025, 23:20:57 GMT

We provide the pseudocode of our BMU-MoCo in Algorithm 1. Algorithm 1 Pseudocode of BMU-MoCo. The R@5 results and its corresponding FR/HM are reported. The memory data are simply used as training samples in the training process. The model architecture is exactly the same as Base-MoCo. Collecting highly parallel data for paraphrase evaluation.

artificial intelligence, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.05)
Asia > China > Zhejiang Province > Hangzhou (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.40)

Add feedback

BMU-MoCo: Bidirectional Momentum Update for Continual Video-Language Modeling Yizhao Gao

Neural Information Processing SystemsAug-16-2025, 23:20:54 GMT

Video-language models suffer from forgetting old/learned knowledge when trained with streaming data.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Asia > China > Beijing > Beijing (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.42)

Add feedback

BMU-MoCo: Bidirectional Momentum Update for Continual Video-Language Modeling

Neural Information Processing SystemsJan-17-2025, 18:41:28 GMT

Video-language models suffer from forgetting old/learned knowledge when trained with streaming data. In this work, we thus propose a continual video-language modeling (CVLM) setting, where models are supposed to be sequentially trained on five widely-used video-text datasets with different data distributions. Although most of existing continual learning methods have achieved great success by exploiting extra information (e.g., memory data of past tasks) or dynamically extended networks, they cause enormous resource consumption when transferred to our CVLM setting. To overcome the challenges (i.e., catastrophic forgetting and heavy resource consumption) in CVLM, we propose a novel cross-modal MoCo-based model with bidirectional momentum update (BMU), termed BMU-MoCo. Concretely, our BMU-MoCo has two core designs: (1) Different from the conventional MoCo, we apply the momentum update to not only momentum encoders but also encoders (i.e., bidirectional) at each training step, which enables the model to review the learned knowledge retained in the momentum encoders.

bidirectional momentum update, bmu-moco, continual video-language modeling, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.64)

Add feedback

Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment

Li, Yunxin, Chen, Xinyu, Hu, Baotian, Shi, Haoyuan, Zhang, Min

arXiv.org Artificial IntelligenceJun-26-2024

Evaluating and Rethinking the current landscape of Large Multimodal Models (LMMs), we observe that widely-used visual-language projection approaches (e.g., Q-former or MLP) focus on the alignment of image-text descriptions yet ignore the visual knowledge-dimension alignment, i.e., connecting visuals to their relevant knowledge. Visual knowledge plays a significant role in analyzing, inferring, and interpreting information from visuals, helping improve the accuracy of answers to knowledge-based visual questions. In this paper, we mainly explore improving LMMs with visual-language knowledge alignment, especially aimed at challenging knowledge-based visual question answering (VQA). To this end, we present a Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage. Specifically, we design the VKA based on the interaction between a small language model and a visual encoder, training it on collected image-knowledge pairs to achieve visual knowledge acquisition and projection. FKA is employed to distill the fine-grained visual knowledge of an image and inject it into Large Language Models (LLMs). We conduct extensive experiments on knowledge-based VQA benchmarks and experimental results show that CVLM significantly improves the performance of LMMs on knowledge-based VQA (average gain by 5.0%). Ablation studies also verify the effectiveness of VKA and FKA, respectively. The codes are available at https://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper

cvlm, knowledge, visual knowledge, (15 more...)

arXiv.org Artificial Intelligence

2402.13561

Country: