Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning

Nguyen, Thong, Bin, Yi, Wu, Xiaobao, Dong, Xinshuai, Hu, Zhiyuan, Le, Khoi, Nguyen, Cong-Duy, Ng, See-Kiong, Tuan, Luu Anh

Jul-4-2024–arXiv.org Artificial Intelligence

Data quality stands at the forefront of deciding the effectiveness of video-language representation learning. However, video-text pairs in previous data typically do not align perfectly with each other, which might lead to video-language representations that do not accurately reflect cross-modal semantics. Moreover, previous data also possess an uneven distribution of concepts, thereby hampering the downstream performance across unpopular subjects. To address these problems, we propose a contrastive objective with a subtractive angular margin to regularize cross-modal representations in their effort to reach perfect similarity. Furthermore, to adapt to the non-uniform concept distribution, we propose a multi-layer perceptron (MLP)-parameterized weighting function that maps loss values to sample weights which enable dynamic adjustment of the model's focus throughout the training. With the training guided by a small amount of unbiased meta-data and augmented by video-text data generated by large vision-language model, we improve video-language representations and achieve superior performances on commonly used video question answering and text-video retrieval datasets.

artificial intelligence, machine learning, natural language, (11 more...)

arXiv.org Artificial Intelligence

Jul-4-2024

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)

Genre:
- Research Report (0.64)

Industry:
- Leisure & Entertainment > Sports (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Perceptrons (0.54)
  - Natural Language (1.00)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found