Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization

Open in new window