Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

Jan-18-2025, 00:24:32 GMT–Neural Information Processing Systems

In natural language processing, most models try to learn semantic representations merely from texts. The learned representations encode the "distributional semantics" but fail to connect to any knowledge about the physical world. In contrast, humans learn language by grounding concepts in perception and action and the brain encodes "grounded semantics" for cognition. Inspired by this notion and recent work in vision-language learning, we design a two-stream model for grounding language learning in vision. The model includes a VGG-based visual stream and a Bert-based language stream.

cross-modal contrastive learning, explainable semantic space, grounding language, (5 more...)

Neural Information Processing Systems

Jan-18-2025, 00:24:32 GMT

Conferences Web Page

Add feedback

Industry:
- Education > Curriculum > Subject-Specific Education (0.52)

Technology:
- Information Technology > Artificial Intelligence > Natural Language (1.00)