End-to-end Multi-modal Video Temporal Grounding Yi-Wen Chen 1 Ming-Hsuan Yang University of California, Merced 2