Efficient Vocabulary-Free Fine-Grained Visual Recognition in the Age of Multimodal LLMs

Open in new window