Artemis: Towards Referential Understanding in Complex Videos

Neural Information Processing Systems 

Videos carry rich visual information including object description, action, interaction, etc., but the existing multimodal large language models (MLLMs) fell short