A Large-Scale Human-Centric Benchmark for Referring Expression Comprehension in the LMM Era
–Neural Information Processing Systems
Prior research in human-centric AI has primarily addressed single-modality tasks like pedestrian detection, action recognition, and pose estimation. However, the emergence of large multimodal models (LMMs) such as GPT-4V has redirected attention towards integrating language with visual content. Referring expression comprehension (REC) represents a prime example of this multimodal approach.
Neural Information Processing Systems
May-30-2025, 12:13:31 GMT