NanoVLMs: How small can we go and still make coherent Vision Language Models?

Open in new window