End-to-end Multi-modal Video Temporal Grounding Yi-Wen Chen 1 Ming-Hsuan Yang University of California, Merced 2
–Neural Information Processing Systems
In this supplementary document, we provide additional analysis and experimental results, including 1) more implementation details, 2) results of single-stream and two-stream models, and 3) more qualitative results for text-guided video temporal grounding. The co-attentional transformer layers in our model have hidden states of size 1024 and 8 attention heads. For the intra-modality contrastive learning, we randomly sample 3 positive videos that contain the same action as the anchor video and 4 negative videos with different action categories. Our framework is implemented on a machine with an Intel Xeon 2.3 GHz processor and an NVIDIA GTX 1080 Ti GPU with 11 GB memory. In Table 1, we present the results of single-stream models using optical flow or depth as visual input, and two-stream models using depth and RGB or depth and optical flow as the two modalities.
Neural Information Processing Systems
Mar-22-2025, 22:54:34 GMT
- Country:
- North America > United States > California > Merced County > Merced (0.41)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning (0.51)
- Vision (0.79)
- Information Technology > Artificial Intelligence