COBE: Contextualized Object Embeddings from Narrated Instructional Video Supplementary Materials

Neural Information Processing Systems 

Our supplementary materials consist of: 1. Implementation Details. We train our model for 10 epochs with an initial learning rate of 0.001, a linear warmup of 500 steps and a momentum of 0.9. We use a multi-scale training approach implemented by resizing the shorter side of the frame randomly between 400 and 800 pixels. Our model is trained in a distributed setting using 64 GPUs, each GPU holding a single frame. We initialize our model with a Faster R-CNN pretrained on COCO for object detection.