Video Token Merging for Long-form Video Understanding