Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM
–Neural Information Processing Systems
Where does'A man is walking in a Locate the moment where "A man For the query'A man recommends narrow alley, with street noise and Determine the precise timestamp in wearing a white mask is speaking visiting local areas in Tokyo, filming the conversations in the background.
Neural Information Processing Systems
Jun-22-2026, 10:27:57 GMT
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Education (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Speech (1.00)
- Natural Language > Large Language Model (1.00)
- Machine Learning (1.00)
- Information Technology > Artificial Intelligence