Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models