A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video
Yu, Haonan, Siddharth, N., Barbu, Andrei, Siskind, Jeffrey Mark
–Journal of Artificial Intelligence Research
We present an approach to simultaneously reasoning about a video clip and an entire natural-language sentence. The compositional nature of language is exploited to construct models which represent the meanings of entire sentences composed out of the meanings of the words in those sentences mediated by a grammar that encodes the predicate-argument relations. We demonstrate that these models faithfully represent the meanings of sentences and are sensitive to how the roles played by participants (nouns), their characteristics (adjectives), the actions performed (verbs), the manner of such actions (adverbs), and changing spatial relations between participants (prepositions) affect the meaning of a sentence and how it is grounded in video. We exploit this methodology in three ways. In the first, a video clip along with a sentence are taken as input and the participants in the event described by the sentence are highlighted, even when the clip depicts multiple similar simultaneous events. In the second, a video clip is taken as input without a sentence and a sentence is generated that describes an event in that clip. In the third, a corpus of video clips is paired with sentences which describe some of the events in those clips and the meanings of the words in those sentences are learned. We learn these meanings without needing to specify which attribute of the video clips each word in a given sentence refers to. The learned meaning representations are shown to be intelligible to humans.
Journal of Artificial Intelligence Research
Apr-30-2015
- Country:
- North America > United States
- Texas > Travis County
- Austin (0.04)
- Indiana > Tippecanoe County
- West Lafayette (0.04)
- Lafayette (0.04)
- Texas > Travis County
- North America > United States
- Genre:
- Research Report (0.67)
- Overview (0.67)
- Industry:
- Leisure & Entertainment > Sports (1.00)
- Government (1.00)
- Media (0.67)
- Technology:
- Information Technology
- Data Science > Data Mining (0.92)
- Sensing and Signal Processing (0.92)
- Artificial Intelligence
- Vision (1.00)
- Representation & Reasoning > Uncertainty (0.92)
- Natural Language
- Text Processing (1.00)
- Grammars & Parsing (1.00)
- Machine Learning
- Statistical Learning (1.00)
- Performance Analysis > Accuracy (1.00)
- Learning Graphical Models > Undirected Networks
- Markov Models (1.00)
- Information Technology