Goto

Collaborating Authors

 Asia









Multi-modalSituated Reasoningin3DScenes

Neural Information Processing Systems

Comprehensiveevaluationson MSQA andMSNN highlight thelimitations ofexisting vision-language models and underscore the importance of handling multi-modal interleaved inputs and situation modeling.