Goto

Collaborating Authors

 iou


Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM

Neural Information Processing Systems

Where does'A man is walking in a Locate the moment where "A man For the query'A man recommends narrow alley, with street noise and Determine the precise timestamp in wearing a white mask is speaking visiting local areas in Tokyo, filming the conversations in the background.


MedSG-Bench: ABenchmark for Medical Image Sequences Grounding

Neural Information Processing Systems

Visual grounding is essential for precise perception and reasoning in multimodal large language models (MLLMs), especially in medical imaging domains. While existing medical visual grounding benchmarks primarily focus on single-image scenarios, real-world clinical applications often involve sequential images, where accurate lesion localization across different modalities and temporal tracking of disease progression (e.g., pre-vs.