Zero-Shot Activity Recognition with Videos
Humans learn language through several perceptive cues and an endless continuum of multimodal interactions. To learn the names of the objects around us, we need some kind of a supervision or a context. Either our parents explicitly point us the tangible, non-abstract objects in our first years, or we grab the meaning of the words from the peripheral context. Likewise, the movements of the objects are described by "verbs". We learn the meaning of the verbs by watching the objects in motion, or we grab a verb through a linguistic context without visually perceiving it. Then we use the learned objects and verbs in different unseen combinations, constitute novel sentences and generalize the verbs and nouns to new unseen instances or cases. There is an ongoing process of connecting, updating and renewing the inputs from different modalities [12]. In this work, we explore the possibilities of learning the verbs from multimodal cues in a similar way to humans and propose a neural network model that aims to jointly capture the visual and textual representation. The problem is to build a cross-modal joint space which will help retrieving a textual modal given a visual modal, or vice versa.
Jan-22-2020
- Country:
- Oceania > Australia
- New South Wales > Sydney (0.04)
- North America > United States
- California > Santa Clara County > Palo Alto (0.04)
- Europe
- Germany > Bavaria
- Upper Bavaria > Munich (0.04)
- Belgium > Flanders
- Flemish Brabant > Leuven (0.04)
- Germany > Bavaria
- Asia > China
- Oceania > Australia
- Genre:
- Research Report (0.64)
- Industry:
- Health & Medicine > Therapeutic Area (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Natural Language (1.00)
- Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence