Video Token Merging for Long Video Understanding