Goto

Collaborating Authors

 keyframe


See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model

Neural Information Processing Systems

We introduce SEE&TREK, the first training-free prompting framework tailored to enhance the spatial understanding of Multimodal Large Language Models (MLLMS) under vision-only constraints. While prior efforts have incorporated modalities like depth or point clouds to improve spatial reasoning, purely visualspatial understanding remains underexplored.


See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model

Neural Information Processing Systems

We introduce See&Trek, the first training-free prompting framework tailored to enhance the spatial understanding of Multimodal Large Language Models (MLLMs) under vision-only constraints. While prior efforts have incorporated modalities like depth or point clouds to improve spatial reasoning, purely visual-spatial understanding remains underexplored.