Li, Shoushan
Omni-SILA: Towards Omni-scene Driven Visual Sentiment Identifying, Locating and Attributing in Videos
Luo, Jiamin, Wang, Jingjing, Ma, Junxiao, Jin, Yujie, Li, Shoushan, Zhou, Guodong
Prior studies on Visual Sentiment Understanding (VSU) primarily rely on the explicit scene information (e.g., facial expression) to judge visual sentiments, which largely ignore implicit scene information (e.g., human action, objection relation and visual background), while such information is critical for precisely discovering visual sentiments. Motivated by this, this paper proposes a new Omni-scene driven visual Sentiment Identifying, Locating and Attributing in videos (Omni-SILA) task, aiming to interactively and precisely identify, locate and attribute visual sentiments through both explicit and implicit scene information. Furthermore, this paper believes that this Omni-SILA task faces two key challenges: modeling scene and highlighting implicit scene beyond explicit. To this end, this paper proposes an Implicit-enhanced Causal MoE (ICM) approach for addressing the Omni-SILA task. Specifically, a Scene-Balanced MoE (SBM) and an Implicit-Enhanced Causal (IEC) blocks are tailored to model scene information and highlight the implicit scene information beyond explicit, respectively. Extensive experimental results on our constructed explicit and implicit Omni-SILA datasets demonstrate the great advantage of the proposed ICM approach over advanced Video-LLMs.
Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor Detection
Liu, Yang, Shen, Tongfei, Zhang, Dong, Sun, Qingying, Li, Shoushan, Zhou, Guodong
The growing importance of multi-modal humor detection within affective computing correlates with the expanding influence of short-form video sharing on social media platforms. In this paper, we propose a novel two-branch hierarchical model for short-form video humor detection (SVHD), named Comment-aided Video-Language Alignment (CVLA) via data-augmented multi-modal contrastive pre-training. Notably, our CVLA not only operates on raw signals across various modal channels but also yields an appropriate multi-modal representation by aligning the video and language components within a consistent semantic space. The experimental results on two humor detection datasets, including DY11k and UR-FUNNY, demonstrate that CVLA dramatically outperforms state-of-the-art and several competitive baseline approaches. Our dataset, code and model release at https://github.com/yliu-cs/CVLA.
Pre-trained Token-replaced Detection Model as Few-shot Learner
Li, Zicheng, Li, Shoushan, Zhou, Guodong
Pre-trained masked language models have demonstrated remarkable ability as few-shot learners. In this paper, as an alternative, we propose a novel approach to few-shot learning with pre-trained token-replaced detection models like ELECTRA. In this approach, we reformulate a classification or a regression task as a token-replaced detection problem. Specifically, we first define a template and label description words for each task and put them into the input to form a natural language prompt. Then, we employ the pre-trained token-replaced detection model to predict which label description word is the most original (i.e., least replaced) among all label description words in the prompt. A systematic evaluation on 16 datasets demonstrates that our approach outperforms few-shot learners with pre-trained masked language models in both one-sentence and two-sentence learning tasks.
Human-Like Decision Making: Document-level Aspect Sentiment Classification via Hierarchical Reinforcement Learning
Wang, Jingjing, Sun, Changlong, Li, Shoushan, Wang, Jiancheng, Si, Luo, Zhang, Min, Liu, Xiaozhong, Zhou, Guodong
Recently, neural networks have shown promising results on Document-level Aspect Sentiment Classification (DASC). However, these approaches often offer little transparency w.r.t. their inner working mechanisms and lack interpretability. In this paper, to simulating the steps of analyzing aspect sentiment in a document by human beings, we propose a new Hierarchical Reinforcement Learning (HRL) approach to DASC. This approach incorporates clause selection and word selection strategies to tackle the data noise problem in the task of DASC. First, a high-level policy is proposed to select aspect-relevant clauses and discard noisy clauses. Then, a low-level policy is proposed to select sentiment-relevant words and discard noisy words inside the selected clauses. Finally, a sentiment rating predictor is designed to provide reward signals to guide both clause and word selection. Experimental results demonstrate the impressive effectiveness of the proposed approach to DASC over the state-of-the-art baselines.
Opinion Target Extraction Using a Shallow Semantic Parsing Framework
Li, Shoushan (Soochow University) | Wang, Rongyang (Soochow University) | Zhou, Guodong (Soochow University)
In this paper, we present a simplified shallow semantic parsing approach to extracting opinion targets. This is done by formulating opinion target extraction (OTE) as a shallow semantic parsing problem with the opinion expression as the predicate and the corresponding targets as its arguments. In principle, our parsing approach to OTE differs from the state-of-the-art sequence labeling one in two aspects. First, we model OTE from parse tree level, where abundant structured syntactic information is available for use, instead of word sequence level, where only lexical information is available. Second, we focus on determining whether a constituent, rather than a word, is an opinion target or not, via a simplified shallow semantic parsing framework. Evaluation on two datasets shows that structured syntactic information plays a critical role in capturing the domination relationship between an opinion expression and its targets. It also shows that our parsing approach much outperforms the state-of-the-art sequence labeling one.