Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck

Open in new window