The significant power of deep learning networks has led to enormous development in object detection. Over the last few years, object detector frameworks have achieved tremendous success in both accuracy and efficiency. However, their ability is far from that of human beings due to several factors, occlusion being one of them. Since occlusion can happen in various locations, scale, and ratio, it is very difficult to handle. In this paper, we address the challenges in occlusion handling in generic object detection in both outdoor and indoor scenes, then we refer to the recent works that have been carried out to overcome these challenges. Finally, we discuss some possible future directions of research.
Training a deep neural model for semantic segmentation requires collecting a large amount of pixel-level labeled data. To alleviate the data scarcity problem presented in the real world, one could utilize synthetic data whose label is easy to obtain. Previous work has shown that the performance of a semantic segmentation model can be improved by training jointly with real and synthetic examples with a proper weighting on the synthetic data. Such weighting was learned by a heuristic to maximize the similarity between synthetic and real examples. In our work, we instead learn a pixel-level weighting of the synthetic data by meta-learning, i.e., the learning of weighting should only be minimizing the loss on the target task. We achieve this by gradient-on-gradient technique to propagate the target loss back into the parameters of the weighting model. The experiments show that our method with only one single meta module can outperform a complicated combination of an adversarial feature alignment, a reconstruction loss, plus a hierarchical heuristic weighting at pixel, region and image levels.
Signal capture stands in the forefront to perceive and understand the environment and thus imaging plays the pivotal role in mobile vision. Recent explosive progresses in Artificial Intelligence (AI) have shown great potential to develop advanced mobile platforms with new imaging devices. Traditional imaging systems based on the "capturing images first and processing afterwards" mechanism cannot meet this unprecedented demand. Differently, Computational Imaging (CI) systems are designed to capture high-dimensional data in an encoded manner to provide more information for mobile vision systems.Thanks to AI, CI can now be used in real systems by integrating deep learning algorithms into the mobile vision platform to achieve the closed loop of intelligent acquisition, processing and decision making, thus leading to the next revolution of mobile vision.Starting from the history of mobile vision using digital cameras, this work first introduces the advances of CI in diverse applications and then conducts a comprehensive review of current research topics combining CI and AI. Motivated by the fact that most existing studies only loosely connect CI and AI (usually using AI to improve the performance of CI and only limited works have deeply connected them), in this work, we propose a framework to deeply integrate CI and AI by using the example of self-driving vehicles with high-speed communication, edge computing and traffic planning. Finally, we outlook the future of CI plus AI by investigating new materials, brain science and new computing techniques to shed light on new directions of mobile vision systems.
Tracking by natural language specification is a new rising research topic that aims at locating the target object in the video sequence based on its language description. Compared with traditional bounding box (BBox) based tracking, this setting guides object tracking with high-level semantic information, addresses the ambiguity of BBox, and links local and global search organically together. Those benefits may bring more flexible, robust and accurate tracking performance in practical scenarios. However, existing natural language initialized trackers are developed and compared on benchmark datasets proposed for tracking-by-BBox, which can't reflect the true power of tracking-by-language. In this work, we propose a new benchmark specifically dedicated to the tracking-by-language, including a large scale dataset, strong and diverse baseline methods. Specifically, we collect 2k video sequences (contains a total of 1,244,340 frames, 663 words) and split 1300/700 for the train/testing respectively. We densely annotate one sentence in English and corresponding bounding boxes of the target object for each video. We also introduce two new challenges into TNL2K for the object tracking task, i.e., adversarial samples and modality switch. A strong baseline method based on an adaptive local-global-search scheme is proposed for future works to compare. We believe this benchmark will greatly boost related researches on natural language guided tracking.
In this paper, we introduce a new dataset, named InstaOrder, that can be used to understand the spatial relationships of instances in a 3D space. The dataset consists of 2.9M annotations of geometric orderings for class-labeled instances in 101K natural scenes. The scenes were annotated by 3,659 crowd-workers regarding (1) occlusion order that identifies occluder/occludee and (2) depth order that describes ordinal relations that consider relative distance from the camera. The dataset provides joint annotation of two kinds of orderings for the same instances, and we discover that the occlusion order and depth order are complementary. We also introduce a geometric order prediction network called InstaOrderNet, which is superior to state-of-the-art approaches. Moreover, we propose InstaDepthNet that uses auxiliary geometric order loss to boost the instance-wise depth prediction accuracy of MiDaS. These contributions to geometric scene understanding will help to improve the accuracy of various computer vision tasks.