Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

Open in new window