Scaling Capability in Token Space: An Analysis of Large Vision Language Model