We present a simple method to reconstruct a high-resolution video from a face-video, where the identity of a person is obscured by pixelization. This concealment method is popular because the viewer can still perceive a human face figure and the overall head motion. However, we show in our experiments that a fairly good approximation of the original video can be reconstructed in a way that compromises anonymity. Our system exploits the simultaneous similarity and small disparity between close-by video frames depicting a human face, and employs a spatial transformation component that learns the alignment between the pixelated frames. Each frame, supported by its aligned surrounding frames, is first encoded, then decoded to a higher resolution. Reconstruction and perceptual losses promote adherence to the ground-truth, and an adversarial loss assists in maintaining domain faithfulness. There is no need for explicit temporal coherency loss as it is maintained implicitly by the alignment of neighboring frames and reconstruction. Although simple, our framework synthesizes high-quality face reconstructions, demonstrating that given the statistical prior of a human face, multiple aligned pixelated frames contain sufficient information to reconstruct a high-quality approximation of the original signal.
As the foundation of driverless vehicle and intelligent robots, Simultaneous Localization and Mapping(SLAM) has attracted much attention these days. However, non-geometric modules of traditional SLAM algorithms are limited by data association tasks and have become a bottleneck preventing the development of SLAM. To deal with such problems, many researchers seek to Deep Learning for help. But most of these studies are limited to virtual datasets or specific environments, and even sacrifice efficiency for accuracy. Thus, they are not practical enough. We propose DF-SLAM system that uses deep local feature descriptors obtained by the neural network as a substitute for traditional hand-made features. Experimental results demonstrate its improvements in efficiency and stability. DF-SLAM outperforms popular traditional SLAM systems in various scenes, including challenging scenes with intense illumination changes. Its versatility and mobility fit well into the need for exploring new environments. Since we adopt a shallow network to extract local descriptors and remain others the same as original SLAM systems, our DF-SLAM can still run in real-time on GPU.
We consider the problem of relative pose regression in visual relocalization. Recently, several promising approaches have emerged in this area. We claim that even though they demonstrate on the same datasets using the same split to train and test, a faithful comparison between them was not available since on currently used evaluation metric, some approaches might perform favorably, while in reality performing worse. We reveal a tradeoff between accuracy and the 3D volume of the regressed subspace. We believe that unlike other relocalization approaches, in the case of relative pose regression, the regressed subspace 3D volume is less dependent on the scene and more affect by the method used to score the overlap, which determined how closely sampled viewpoints are. We propose three new metrics to remedy the issue mentioned above. The proposed metrics incorporate statistics about the regression subspace volume. We also propose a new pose regression network that serves as a new baseline for this task. We compare the performance of our trained model on Microsoft 7-Scenes and Cambridge Landmarks datasets both with the standard metrics and the newly proposed metrics and adjust the overlap score to reveal the tradeoff between the subspace and performance. The results show that the proposed metrics are more robust to different overlap threshold than the conventional approaches. Finally, we show that our network generalizes well, specifically, training on a single scene leads to little loss of performance on the other scenes.
Recent general AI conferences show a decline in both the number and the quality of vision papers, but there is tremendous growth in, and specialization of, computer vision conferences. Hence, one might conclude that computer vision is parting or has parted company with AI. This article proposes that the divorce of computer vision and AI suggested here is actually an open marriage: Although computer vision is developing through its own research agenda, there are many shared areas of interest, and many of the key goals, assumptions, and characteristics of computer vision are also clearly found in AI.
Current UAV-recorded datasets are mostly limited to action recognition and object tracking, whereas the gesture signals datasets were mostly recorded in indoor spaces. Currently, there is no outdoor recorded public video dataset for UAV commanding signals. Gesture signals can be effectively used with UAVs by leveraging the UAVs visual sensors and operational simplicity. To fill this gap and enable research in wider application areas, we present a UAV gesture signals dataset recorded in an outdoor setting. We selected 13 gestures suitable for basic UAV navigation and command from general aircraft handling and helicopter handling signals. We provide 119 high-definition video clips consisting of 37151 frames. The overall baseline gesture recognition performance computed using Pose-based Convolutional Neural Network (P-CNN) is 91.9 %. All the frames are annotated with body joints and gesture classes in order to extend the dataset's applicability to a wider research area including gesture recognition, action recognition, human pose recognition and situation awareness.