AGQA: A Benchmark for Compositional, Spatio-Temporal Reasoning
Take a look at the video above and the associated question – What did they hold before opening the closet?. After looking at the video, you can easily answer that the person is holding a phone. People have a remarkable ability to comprehend visual events in new videos and to answer questions about that video. For instance, the person initially holds a phone and then opens the closet and takes out a picture. To answer this question, we need to recognize the action "opening the closet" and then understand how "before" should restrict our search for the answer to events before this action.
Jun-21-2021, 22:03:10 GMT
- Country:
- North America > United States > California > Santa Clara County > Palo Alto (0.40)
- Technology: