Multimodal Fusion and Coherence Modeling for Video Topic Segmentation

Yu, Hai, Deng, Chong, Zhang, Qinglin, Liu, Jiaqing, Chen, Qian, Wang, Wen

Aug-1-2024–arXiv.org Artificial Intelligence

Also, coherence is essential for data (Koshorek et al., 2018; Arnold et al., 2019), understanding logical structures and semantics. Enhancing contemporary supervised models (Lukasik et al., coherence modeling has achieved significant 2020; Somasundaran et al., 2020; Zhang et al., improvements in long document topic segmentation 2021; Yu et al., 2023) have demonstrated superior (Yu et al., 2023). Therefore, we improve performance compared to unsupervised approaches supervised VTS methods by thoroughly exploring (Riedl and Biemann, 2012; Solbiati et al., multimodal fusion and multimodal coherence 2021). Notably, supervised models that excel at modeling. We enhance multimodal fusion modeling long sequences (Zhang et al., 2021; Yu from the perspectives of model architecture and et al., 2023) are capable of capturing longer contextual pre-training and fine-tuning tasks. Specifically, we nuances and thereby achieve better topic segmentation compare various multimodal fusion architectures performance, compared to models that built upon Cross-Attention and Mixture-of-Experts model local sentence pairs or block pairs (Wang (MoE). We investigate the effect of multi-modal et al., 2017; Lukasik et al., 2020). In addition, contrastive learning for general pre-training and recent works (Somasundaran et al., 2020; Xing fine-tuning for strengthening cross-modal alignment.

proceedings, segmentation, topic segmentation, (15 more...)

arXiv.org Artificial Intelligence

Aug-1-2024

arXiv.org PDF

Add feedback

Country:
- Asia (0.04)

Genre:
- Research Report (0.82)
- Instructional Material > Course Syllabus & Notes (0.46)

Industry:
- Education > Educational Technology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.95)
  - Machine Learning > Neural Networks (0.69)