Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study

Open in new window