Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

Open in new window