Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding