Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries

Zhang, Yuting, Yuan, Luyao, Guo, Yijie, He, Zhiyuan, Huang, I-An, Lee, Honglak

Apr-17-2017–arXiv.org Machine Learning

Associating image regions with text queries has been recently explored as a new way to bridge visual and linguistic representations. A few pioneering approaches have been proposed based on recurrent neural language models trained generatively (e.g., generating captions), but achieving somewhat limited localization accuracy. To better address natural-language-based visual entity localization, we propose a discriminative approach. We formulate a discriminative bimodal neural network (DBNet), which can be trained by a classifier with extensive use of negative samples. Our training objective encourages better localization on single images, incorporates text phrases in a broad range, and properly pairs image regions with text phrases into positive and negative examples. Experiments on the Visual Genome dataset demonstrate the proposed DBNet significantly outperforms previous state-of-the-art methods both for localization on single images and for detection on multiple images. We we also establish an evaluation protocol for natural-language visual detection.

artificial intelligence, machine learning, natural language, (13 more...)

arXiv.org Machine Learning

Apr-17-2017

arXiv.org PDF

Add feedback

Country:
- North America > United States > Michigan (0.28)

Genre:
- Research Report > Promising Solution (0.54)

Industry:
- Transportation > Ground
  - Road (0.67)
- Leisure & Entertainment > Sports
  - Tennis (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning
    - Neural Networks > Deep Learning (0.93)
    - Statistical Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found