Semantic Frame Aggregation-based Transformer for Live Video Comment Generation

Fatima, Anam, Yu, Yi, Kapuriya, Janak, Lalanne, Julien, Shukla, Jainendra

Nov-3-2025–arXiv.org Artificial Intelligence

Abstract--Live commenting on video streams has surged in popularity on platforms like Twitch, enhancing viewer engagement through dynamic interactions. However, automatically generating contextually appropriate comments remains a challenging and exciting task. Video streams can contain a vast amount of data and extraneous content. Existing approaches tend to overlook an important aspect of prioritizing video frames that are most relevant to ongoing viewer interactions. This prioritization is crucial for producing contextually appropriate comments. T o address this gap, we introduce a novel Semantic Frame Aggregation-based Transformer (SF A T) model for live video comment generation. This method not only leverages CLIP's visual-text multimodal knowledge to generate comments but also assigns weights to video frames based on their semantic relevance to ongoing viewer conversation. It employs an efficient weighted sum of frames technique to emphasize informative frames while focusing less on irrelevant ones. Finally, our comment decoder with cross-attention mechanism to attend to each modality ensures that the generated comment reflects contextual cues from both chats and video. Furthermore, to address the limitations of existing datasets, which predominantly focus on Chinese-language content with limited video categories,, we have constructed a large-scale, diverse, multimodal English video comments dataset. Extracted from Twitch, this dataset covers 11 video categories, totaling 438 hours and 3.2 million comments. We demonstrate the effectiveness of our SF A T model by comparing it to existing methods for generating comments from live video and ongoing dialogue contexts. IVE commenting on videos has become a popular feature in live streaming platforms such as Twitch, Y ouTube, Bilibili, Facebook and Instagram. Also known as "bullet screen" or "danmaku", it offers a dynamic and interactive experience, promoting engagement and conversations among viewers [1]-[3]. In contrast to traditional video comments, which neither reference specific moments in the video nor interact with one another, danmaku comments enable rich multimodal information interactions [4]. Y u is with Graduate School of Advanced Science and Engineering at Hiroshima University.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Nov-3-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Japan > Honshū > Chūgoku > Hiroshima Prefecture > Hiroshima (0.24)

Genre:
- Research Report (1.00)

Industry:
- Education (0.48)
- Leisure & Entertainment (0.46)

Technology:
- Information Technology
  - Communications > Social Media (1.00)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Natural Language
      - Large Language Model (1.00)
      - Chatbot (0.94)
      - Text Processing (0.85)
    - Machine Learning > Neural Networks
      - Deep Learning (0.95)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found