Unleash the Potential of CLIP for Video Highlight Detection

Han, Donghoon, Seo, Seunghyeon, Park, Eunhwan, Nam, Seong-Uk, Kwak, Nojun

Apr-2-2024–arXiv.org Artificial Intelligence

Multimodal and large language models (LLMs) have revolutionized the utilization of open-world knowledge, unlocking novel potentials across various tasks and applications. Among these domains, the video domain has notably benefited from their capabilities. In this paper, we present Highlight-CLIP (HL-CLIP), a method designed to excel in the video highlight detection task by leveraging the pre-trained knowledge embedded in multimodal models. By simply fine-tuning the multimodal encoder in combination with our innovative saliency pooling technique, we have achieved the state-of-the-art performance in the highlight detection task, the QVHighlight Benchmark, to the best of our knowledge.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Apr-2-2024

arXiv.org PDF

Add feedback

Country:
- Africa > Rwanda (0.14)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning (1.00)
  - Natural Language
    - Large Language Model (0.69)
    - Text Processing (0.47)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found