Inference Optimal VLMs Need Only One Visual Token but Larger Models

Open in new window