text enhancement
- South America (0.04)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
- Asia > Singapore (0.04)
- (3 more...)
- South America (0.04)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
- Asia > Singapore (0.04)
- (3 more...)
TECO: Improving Multimodal Intent Recognition with Text Enhancement through Commonsense Knowledge Extraction
Nguyen, Quynh-Mai Thi, Nguyen, Lan-Nhi Thi, Nguyen, Cam-Van Thi
The objective of multimodal intent recognition (MIR) is to leverage various modalities-such as text, video, and audio-to detect user intentions, which is crucial for understanding human language and context in dialogue systems. Despite advances in this field, two main challenges persist: (1) effectively extracting and utilizing semantic information from robust textual features; (2) aligning and fusing non-verbal modalities with verbal ones effectively. This paper proposes a Text Enhancement with CommOnsense Knowledge Extractor (TECO) to address these challenges. We begin by extracting relations from both generated and retrieved knowledge to enrich the contextual information in the text modality. Subsequently, we align and integrate visual and acoustic representations with these enhanced text features to form a cohesive multimodal representation. Our experimental results show substantial improvements over existing baseline methods.
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.73)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Belief Revision (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes
Qin, Yulei, Chen, Xingyu, Shen, Yunhang, Fu, Chaoyou, Gu, Yun, Li, Ke, Sun, Xing, Ji, Rongrong
Webly supervised learning has attracted increasing attention for its effectiveness in exploring publicly accessible data at scale without manual annotation. However, most existing methods of learning with web datasets are faced with challenges from label noise, and they have limited assumptions on clean samples under various noise. For instance, web images retrieved with queries of tiger cat (a cat species) and drumstick (a musical instrument) are almost dominated by images of tigers and chickens, which exacerbates the challenge of fine-grained visual concept learning. In this case, exploiting both web images and their associated texts is a requisite solution to combat real-world noise. In this paper, we propose Cross-modality Aligned Prototypes (CAPro), a unified prototypical contrastive learning framework to learn visual representations with correct semantics. For one thing, we leverage textual prototypes, which stem from the distinct concept definition of classes, to select clean images by text matching and thus disambiguate the formation of visual prototypes. For another, to handle missing and mismatched noisy texts, we resort to the visual feature space to complete and enhance individual texts and thereafter improve text matching. Such semantically aligned visual prototypes are further polished up with high-quality samples, and engaged in both cluster regularization and noise removal. Besides, we propose collective bootstrapping to encourage smoother and wiser label reference from appearance-similar instances in a manner of dictionary look-up. Extensive experiments on WebVision1k and NUS-WIDE (Web) demonstrate that CAPro well handles realistic noise under both single-label and multi-label scenarios. CAPro achieves new state-of-the-art performance and exhibits robustness to open-set recognition. Codes are available at https://github.com/yuleiqin/capro.
- South America (0.04)
- North America > United States > California > Santa Cruz County > Santa Cruz (0.04)
- Asia > Singapore (0.04)
- (3 more...)