Multimodal Fusion and Coherence Modeling for Video Topic Segmentation

Yu, Hai, Deng, Chong, Zhang, Qinglin, Liu, Jiaqing, Chen, Qian, Wang, Wen

arXiv.org Artificial Intelligence 

Also, coherence is essential for data (Koshorek et al., 2018; Arnold et al., 2019), understanding logical structures and semantics. Enhancing contemporary supervised models (Lukasik et al., coherence modeling has achieved significant 2020; Somasundaran et al., 2020; Zhang et al., improvements in long document topic segmentation 2021; Yu et al., 2023) have demonstrated superior (Yu et al., 2023). Therefore, we improve performance compared to unsupervised approaches supervised VTS methods by thoroughly exploring (Riedl and Biemann, 2012; Solbiati et al., multimodal fusion and multimodal coherence 2021). Notably, supervised models that excel at modeling. We enhance multimodal fusion modeling long sequences (Zhang et al., 2021; Yu from the perspectives of model architecture and et al., 2023) are capable of capturing longer contextual pre-training and fine-tuning tasks. Specifically, we nuances and thereby achieve better topic segmentation compare various multimodal fusion architectures performance, compared to models that built upon Cross-Attention and Mixture-of-Experts model local sentence pairs or block pairs (Wang (MoE). We investigate the effect of multi-modal et al., 2017; Lukasik et al., 2020). In addition, contrastive learning for general pre-training and recent works (Somasundaran et al., 2020; Xing fine-tuning for strengthening cross-modal alignment.