AITopics | Zhu, Yukun

Neptune: The Long Orbit to Benchmarking Long Video Understanding

Nagrani, Arsha, Zhang, Mingda, Mehran, Ramin, Hornung, Rachel, Gundavarapu, Nitesh Bharadwaj, Jha, Nilpa, Myers, Austin, Zhou, Xingyi, Gong, Boqing, Schmid, Cordelia, Sirotenko, Mikhail, Zhu, Yukun, Weyand, Tobias

arXiv.org Artificial IntelligenceDec-12-2024

This paper describes a semi-automatic pipeline to generate challenging question-answer-decoy sets for understanding long videos. Many existing video datasets and models are focused on short clips (10s-30s). While some long video datasets do exist, they can often be solved by powerful image models applied per frame (and often to very few frames) in a video, and are usually manually annotated at high cost. In order to mitigate both these problems, we propose a scalable dataset creation pipeline which leverages large models (VLMs and LLMs), to automatically generate dense, time-aligned video captions, as well as tough question answer decoy sets for video segments (up to 15 minutes in length). Our dataset Neptune covers a broad range of long video reasoning abilities and consists of a subset that emphasizes multimodal reasoning. Since existing metrics for open-ended question answering are either rule-based or may rely on proprietary models, we provide a new open source model-based metric GEM to score open-ended responses on Neptune. Benchmark evaluations reveal that most current open-source long video models perform poorly on Neptune, particularly on questions testing temporal ordering, counting and state changes. Through Neptune, we aim to spur the development of more advanced models capable of understanding long videos. The dataset is available at https://github.com/google-deepmind/neptune

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.09582

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Sports (1.00)
Health & Medicine (0.93)
Media > Film (0.92)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Video Summarization: Towards Entity-Aware Captions

Ayyubi, Hammad A., Liu, Tianqi, Nagrani, Arsha, Lin, Xudong, Zhang, Mingda, Arnab, Anurag, Han, Feng, Zhu, Yukun, Liu, Jialu, Chang, Shih-Fu

arXiv.org Artificial IntelligenceDec-1-2023

Existing popular video captioning benchmarks and models deal with generic captions devoid of specific person, place or organization named entities. In contrast, news videos present a challenging setting where the caption requires such named entities for meaningful summarization. As such, we propose the task of summarizing news video directly to entity-aware captions. We also release a large-scale dataset, VIEWS (VIdeo NEWS), to support research on this task. Further, we propose a method that augments visual information from videos with context retrieved from external world knowledge to generate entity-aware captions. We demonstrate the effectiveness of our approach on three video captioning models. We also show that our approach generalizes to existing news image captions dataset. With all the extensive experiments and insights, we believe we establish a solid basis for future research on this challenging task.

caption, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2312.02188

Country:

Europe (1.00)
Asia > Middle East > Iraq (1.00)
Africa (0.93)
North America > United States > California > San Francisco County > San Francisco (0.14)

Genre: Research Report (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Government > Military (1.00)
Leisure & Entertainment (0.93)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)

Add feedback

3D Object Proposals for Accurate Object Class Detection

Chen, Xiaozhi, Kundu, Kaustav, Zhu, Yukun, Berneshawi, Andrew G., Ma, Huimin, Fidler, Sanja, Urtasun, Raquel

Neural Information Processing SystemsFeb-14-2020, 06:10:38 GMT

The goal of this paper is to generate high-quality 3D object proposals in the context of autonomous driving. Our method exploits stereo imagery to place proposals in the form of 3D bounding boxes. We formulate the problem as minimizing an energy function encoding object size priors, ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. Combined with convolutional neural net (CNN) scoring, our approach outperforms all existing results on all three KITTI object classes.

artificial intelligence, object-oriented architecture, proposal, (2 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Object-Oriented Architecture (0.68)
Information Technology > Artificial Intelligence > Machine Learning (0.57)

Add feedback

Panoptic-DeepLab

Cheng, Bowen, Collins, Maxwell D., Zhu, Yukun, Liu, Ting, Huang, Thomas S., Adam, Hartwig, Chen, Liang-Chieh

arXiv.org Machine LearningOct-23-2019

Our Panoptic-DeepLab is conceptually simple and delivers state-of-the-art results. In particular, we adopt the dual-ASPP and dual-decoder structures specific to semantic, and instance segmentation, respectively. The semantic segmentation branch is the same as the typical design of any semantic segmentation model ( e.g., DeepLab), while the instance segmentation branch is class-agnostic, involving a simple instance center regression. Our single Panoptic-DeepLab sets the new state-of-art at all three Cityscapes benchmarks, reaching 84.2% mIoU, 39.0% AP, and 65.5% PQ on test set, and advances results on the other challenging Mapillary Vistas. 1. Introduction Our bottom-up Panoptic-DeepLab is conceptually simple and delivers state-of-the-art panoptic segmentation results [7]. We adopt dual-ASPP and dual-decoder modules, specific to semantic segmentation and instance segmentation, respectively. The semantic segmentation branch follows the typical design of any semantic segmentation model (e.g., DeepLab [2]), while the instance segmentation prediction involves a simple instance center regression [1, 5], where the model learns to predict instance centers as well as the offset from each pixel to its corresponding center.

artificial intelligence, machine learning, segmentation, (16 more...)

arXiv.org Machine Learning

1910.04751

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Searching for Efficient Multi-Scale Architectures for Dense Image Prediction

Chen, Liang-Chieh, Collins, Maxwell, Zhu, Yukun, Papandreou, George, Zoph, Barret, Schroff, Florian, Adam, Hartwig, Shlens, Jon

Neural Information Processing SystemsDec-31-2018

The design of neural network architectures is an important component for achieving state-of-the-art performance with machine learning systems across a broad array of tasks. Much work has endeavored to design and build architectures automatically through clever construction of a search space paired with simple learning algorithms. Recent progress has demonstrated that such meta-learning methods may exceed scalable human-invented architectures on image classification tasks. An open question is the degree to which such methods may generalize to new domains. In this work we explore the construction of meta-learning techniques for dense image prediction focused on the tasks of scene parsing, person-part segmentation, and semantic image segmentation. Constructing viable search spaces in this domain is challenging because of the multi-scale representation of visual information and the necessity to operate on high resolution imagery. Based on a survey of techniques in dense image prediction, we construct a recursive search space and demonstrate that even with efficient random search, we can identify architectures that outperform human-invented architectures and achieve state-of-the-art performance on three dense prediction tasks including 82.7% on Cityscapes (street scene parsing), 71.3% on PASCAL-Person-Part (person-part segmentation), and 87.9% on PASCAL VOC 2012 (semantic image segmentation). Additionally, the resulting architecture is more computationally efficient, requiring half the parameters and half the computational cost as previous state of the art systems.

artificial intelligence, machine learning, survey article, (18 more...)

Neural Information Processing Systems

Country: North America > Canada (0.14)

Genre:

Overview (0.54)
Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Searching for Efficient Multi-Scale Architectures for Dense Image Prediction

Chen, Liang-Chieh, Collins, Maxwell, Zhu, Yukun, Papandreou, George, Zoph, Barret, Schroff, Florian, Adam, Hartwig, Shlens, Jon

Neural Information Processing SystemsDec-31-2018

The design of neural network architectures is an important component for achieving state-of-the-art performance with machine learning systems across a broad array of tasks. Much work has endeavored to design and build architectures automatically through clever construction of a search space paired with simple learning algorithms. Recent progress has demonstrated that such meta-learning methods may exceed scalable human-invented architectures on image classification tasks. An open question is the degree to which such methods may generalize to new domains. In this work we explore the construction of meta-learning techniques for dense image prediction focused on the tasks of scene parsing, person-part segmentation, and semantic image segmentation. Constructing viable search spaces in this domain is challenging because of the multi-scale representation of visual information and the necessity to operate on high resolution imagery. Based on a survey of techniques in dense image prediction, we construct a recursive search space and demonstrate that even with efficient random search, we can identify architectures that outperform human-invented architectures and achieve state-of-the-art performance on three dense prediction tasks including 82.7% on Cityscapes (street scene parsing), 71.3% on PASCAL-Person-Part (person-part segmentation), and 87.9% on PASCAL VOC 2012 (semantic image segmentation). Additionally, the resulting architecture is more computationally efficient, requiring half the parameters and half the computational cost as previous state of the art systems.

architecture, deep learning, neural network, (18 more...)

Neural Information Processing Systems

Country: North America > Canada (0.14)

Genre:

Overview (0.54)
Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Searching for Efficient Multi-Scale Architectures for Dense Image Prediction

Chen, Liang-Chieh, Collins, Maxwell D., Zhu, Yukun, Papandreou, George, Zoph, Barret, Schroff, Florian, Adam, Hartwig, Shlens, Jonathon

arXiv.org Machine LearningSep-11-2018

The design of neural network architectures is an important component for achieving state-of-the-art performance with machine learning systems across a broad array of tasks. Much work has endeavored to design and build architectures automatically through clever construction of a search space paired with simple learning algorithms. Recent progress has demonstrated that such meta-learning methods may exceed scalable human-invented architectures on image classification tasks. An open question is the degree to which such methods may generalize to new domains. In this work we explore the construction of meta-learning techniques for dense image prediction focused on the tasks of scene parsing, person-part segmentation, and semantic image segmentation. Constructing viable search spaces in this domain is challenging because of the multi-scale representation of visual information and the necessity to operate on high resolution imagery. Based on a survey of techniques in dense image prediction, we construct a recursive search space and demonstrate that even with efficient random search, we can identify architectures that outperform human-invented architectures and achieve state-of-the-art performance on three dense prediction tasks including 82.7\% on Cityscapes (street scene parsing), 71.3\% on PASCAL-Person-Part (person-part segmentation), and 87.9\% on PASCAL VOC 2012 (semantic image segmentation). Additionally, the resulting architecture is more computationally efficient, requiring half the parameters and half the computational cost as previous state of the art systems.

architecture, deep learning, neural network, (18 more...)

arXiv.org Machine Learning

1809.04184

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Skip-Thought Vectors

Kiros, Ryan, Zhu, Yukun, Salakhutdinov, Ruslan R., Zemel, Richard, Urtasun, Raquel, Torralba, Antonio, Fidler, Sanja

Neural Information Processing SystemsDec-31-2015

We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice. We will make our encoder publicly available.

deep learning, neural network, vector, (22 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)

Add feedback

3D Object Proposals for Accurate Object Class Detection

Chen, Xiaozhi, Kundu, Kaustav, Zhu, Yukun, Berneshawi, Andrew G., Ma, Huimin, Fidler, Sanja, Urtasun, Raquel

Neural Information Processing SystemsDec-31-2015

The goal of this paper is to generate high-quality 3D object proposals in the context of autonomous driving. Our method exploits stereo imagery to place proposals in the form of 3D bounding boxes. We formulate the problem as minimizing an energy function encoding object size priors, ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. Combined with convolutional neural net (CNN) scoring, our approach outperforms all existing results on all three KITTI object classes.

artificial intelligence, neural network, proposal, (16 more...)

Neural Information Processing Systems

Country: North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.68)

Industry: