Zhang, Sheng (South China University of Technology) | Liu, Yuliang (South China University of Technology) | Jin, Lianwen (South China University of Technology) | Luo, Canjie (South China University of Technology)
In this paper, we propose a refined scene text detector with a novel Feature Enhancement Network (FEN)for Region Proposal and Text Detection Refinement. Retrospectively, both region proposal with only 3 x 3 sliding-window feature and text detection refinement with single scale high level feature are insufficient, especially for smaller scene text. Therefore, we design a new FEN network with task-specific, low and high level semantic features fusion to improve the performance of text detection. Besides, since unitary position-sensitive RoI pooling in general object detection is unreasonable for variable text regions, an adaptively weighted position-sensitive RoI pooling layer is devised for further enhancing the detecting accuracy. To tackle the sample-imbalance problem during the refinement stage,we also propose an effective positives mining strategy for efficiently training our network. Experiments on ICDAR2011 and 2013 robust text detection benchmarks demonstrate that our method can achieve state-of-the-art results, outperforming all reported methods in terms of F-measure.
Song, Jingkuan (University of Electronic Science and Technology of China) | He, Tao (University of Electronic Science and Technology of China) | Gao, Lianli (University of Electronic Science and Technology of China) | Xu, Xing (University of Electronic Science and Technology of China) | Shen, Heng Tao (University of Electronic Science and Technology of China)
Instance Search (INS) is a fundamental problem for many applications, while it is more challenging comparing to traditional image search since the relevancy is defined at the instance level. Existing works have demonstrated the success of many complex ensemble systems that are typically conducted by firstly generating object proposals, and then extracting handcrafted and/or CNN features of each proposal for matching. However, object bounding box proposals and feature extraction are often conducted in two separated steps, thus the effectiveness of these methods collapses. Also, due to the large amount of generated proposals, matching speed becomes the bottleneck that limits its application to large-scale datasets. To tackle these issues, in this paper we propose an effective and efficient Deep Region Hashing (DRH) approach for large-scale INS using an image patch as the query. Specifically, DRH is an end-to-end deep neural network which consists of object proposal, feature extraction, and hash code generation. DRH shares full-image convolutional feature map with the region proposal network, thus enabling nearly cost-free region proposals. Also, each high-dimensional, real-valued region features are mapped onto a low-dimensional, compact binary codes for the efficient object region level matching on large-scale dataset. Experimental results on four datasets show that our DRH can achieve even better performance than the state-of-the-arts in terms of mAP, while the efficiency is improved by nearly 100 times.
A few weeks back we wrote a post on Object detection using YOLOv3. The output of an object detector is an array of bounding boxes around objects detected in the image or video frame, but we do not get any clue about the shape of the object inside the bounding box. Wouldn't it be cool if we could find a binary mask containing the object instead of just the bounding box? In this post, we will learn how to do just that. We will show how to use a Convolutional Neural Network (CNN) model called Mask-RCNN (Region based Convolutional Neural Network) for object detection and segmentation.
Recognizing multiple labels of images is a fundamental but challenging task in computer vision, and remarkable progress has been attained by localizing semantic-aware image regions and predicting their labels with deep convolutional neural networks. The step of hypothesis regions (region proposals) localization in these existing multi-label image recognition pipelines, however, usually takes redundant computation cost, e.g., generating hundreds of meaningless proposals with non-discriminative information and extracting their features, and the spatial contextual dependency modeling among the localized regions are often ignored or over-simplified. To resolve these issues, this paper proposes a recurrent attention reinforcement learning framework to iteratively discover a sequence of attentional and informative regions that are related to different semantic objects and further predict label scores conditioned on these regions. Besides, our method explicitly models long-term dependencies among these attentional regions that help to capture semantic label co-occurrence and thus facilitate multi-label recognition. Extensive experiments and comparisons on two large-scale benchmarks (i.e., PASCAL VOC and MS-COCO) show that our model achieves superior performance over existing state-of-the-art methods in both performance and efficiency as well as explicitly identifying image-level semantic labels to specific object regions.