Review for NeurIPS paper: COBE: Contextualized Object Embeddings from Narrated Instructional Video
–Neural Information Processing Systems
While this algorithm is specifically designed for detectors, Miech et al 2019 used unsupervised NCE losses (much like the ones in this paper) in order to understand the natural language descriptions associated with videos; the algorithm presented here seems like the most straightforward extension of this idea to bounding boxes. Little attention is given to demonstrating that the use of bounding boxes fundamentally changes the problem. Update The rebuttal addresses the following point regarding the accuracy of the evaluation. I had misunderstood the annotations that are available with epic kitchens, and therefore I am changing my review. I would encourage the authors to clarify the writing regarding what's available with epic kitchens.
Neural Information Processing Systems
Jan-27-2025, 13:40:42 GMT